diff --git a/scraped_kb_articles/%25run-magic-command-not-working-as-expected-with-%25python-listed-first.json b/scraped_kb_articles/%25run-magic-command-not-working-as-expected-with-%25python-listed-first.json new file mode 100644 index 0000000000000000000000000000000000000000..f253bee1e482933eb022eecacb912877d755d34f --- /dev/null +++ b/scraped_kb_articles/%25run-magic-command-not-working-as-expected-with-%25python-listed-first.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/%25run-magic-command-not-working-as-expected-with-%25python-listed-first", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use a\n%run\ncommand, you list it after\n%python\nin a notebook as in the following code snippet. You then notice the command behaves unexpectedly or inconsistently. However, when you run notebooks exported in file formats like\n.py\nor\n.ipybn\nfiles, they successfully run.\n%python\r\n%run\r\n...\nCause\nThe\n%run\ncommand must be the first line in a command cell to execute properly in Databricks notebooks.\nWhen it is not the first command, and a command like\n%python\nis written first, the iPython version of the\n%run\ncommand is called, leading to the observed different behavior from the Databricks version of this command.\nSolution\nEnsure that the\n%run\ncommand is the first line in the command cell when invoking a notebook from another notebook.\nFor more information using\n%run\nto import a notebook, review the\nOrchestrate notebooks and modularize code in notebooks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector.json b/scraped_kb_articles/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector.json new file mode 100644 index 0000000000000000000000000000000000000000..7fe235dcbadb9fff524198cb7e968b4da847476f --- /dev/null +++ b/scraped_kb_articles/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to ensure your Databricks account portal is inaccessible from the internet while still allowing a connection from Microsoft Entra ID to Databricks. You follow the steps to allowlist Microsoft Entra ID IPs as recommended in the\nTutorial: Develop and plan provisioning for a SCIM endpoint in Microsoft Entra ID\ndocumentation. The current requirement is to add all the Microsoft Entra ID IPs to the Databricks IP access list, so you can establish a connection from the Microsoft Entra ID service for SCIM. You have IP access control lists (ACLs) enabled at the account level.\nDuring this setup, you encounter a 403 error when you attempt to synchronize users through an Azure Databricks SCIM provisioning connector.\nCause\nThe IP addresses Microsoft Entra ID service uses are dynamic. In order for Microsoft Entra ID SCIM provisioning to work with an IP ACL in place, the IP addresses with the AzureActiveDirectory tag need to be allowlisted. Their dynamic nature means it is difficult to maintain an updated IP ACL manually.\nSolution\nImplement an automation script to manage dynamic IP addresses. The script should do the following.\nRetrieve the current list of public IP address prefixes for EntraID using the Python requests API.\nUpdate the Databricks IP access list using the accounts IP access lists API (using\nPATCH\n).\nBest practices while implementing your script include the following.\nRegularly download the updated IP ranges file from Microsoft and update the IP access list accordingly. This file is updated weekly, and new ranges will not be used in Azure for at least one week.\nTest the automation script rigorously in a lower environment before implementing it in production.\nTake a backup of the current IP access list before the first implementation.\nNote\nUpdating an IP access list requires the admin role.\nHow to run the script\nFirst, preview the changes that the script makes.\npython .py\nor\npython .py --dry-run=true\nThe script outputs a list of IP ranges that will be added to the Databricks IP access list, and a list of IP ranges to be removed from the Databricks IP access list.\nThen, apply the changes to the Databricks IP access list.\npython .py --dry-run=false\nExample script\nYou can use the following example script for\n.py\n. It sets constants, gets the latest service tags and address prefixes for Azure Active Directory. Then it gets the IP access list by ID name and updates the IP access list. Last, it determines changes and updates the IP access list with new prefixes.\nimport requests\r\nimport json\r\nimport re\r\nimport argparse\r\n\r\n# Constants\r\nMICROSOFT_DOWNLOAD_PAGE = \"https://www.microsoft.com/en-us/download/details.aspx?id=56519\"\r\naccount_id = ''\r\nDATABRICKS_INSTANCE = 'https://accounts.azuredatabricks.net/'\r\nDATABRICKS_TOKEN = ''\r\nIP_ACCESS_LIST_NAME = \"AzureActiveDirectory\"\r\n\r\n# Function to get the latest service tags JSON URL\r\ndef get_latest_service_tags_url():\r\n    headers = {\"User-Agent\": \"curl/7.81.0\", \"Accept\": \"*/*\"}\r\n    response = requests.get(MICROSOFT_DOWNLOAD_PAGE, headers=headers)\r\n    match = re.search(r'ServiceTags_Public_\\d+\\.json', response.text)\r\n    if not match:\r\n        raise ValueError(\"Could not find the latest ServiceTags_Public JSON file.\")\r\n    filename = match.group(0)\r\n    return f\"https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/{filename}\"\r\n\r\n# Function to get the address prefixes for Azure Active Directory\r\ndef get_azure_ad_prefixes(url):\r\n    response = requests.get(url)\r\n    data = response.json()\r\n    azure_ad_prefixes = []\r\n    for item in data['values']:\r\n        if item['name'] == \"AzureActiveDirectory\" and item['id'] == \"AzureActiveDirectory\":\r\n            azure_ad_prefixes = item['properties']['addressPrefixes']\r\n            azure_ad_prefixes = [prefix for prefix in azure_ad_prefixes if \":\" not in prefix]\r\n            break\r\n    return azure_ad_prefixes\r\n\r\n# Function to get the IP access list ID by name\r\ndef get_ip_access_list_id(access_list_name):\r\n    url = f\"{DATABRICKS_INSTANCE}/api/2.0/accounts/{account_id}/ip-access-lists\"\r\n    headers = {\"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\"}\r\n    response = requests.get(url, headers=headers)\r\n    response.raise_for_status()\r\n    access_lists = response.json()['ip_access_lists']\r\n    for acl in access_lists:\r\n        if acl['label'] == access_list_name:\r\n            return acl['list_id'], acl['ip_addresses']\r\n    raise ValueError(f\"Access list with label {access_list_name} not found.\")\r\n\r\n# Function to update the IP access list\r\ndef update_ip_access_list(access_list_id, ip_prefixes):\r\n    url = f\"{DATABRICKS_INSTANCE}/api/2.0/accounts/{account_id}/ip-access-lists/{access_list_id}\"\r\n    headers = {\r\n        \"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\",\r\n        \"Content-Type\": \"application/json\"\r\n    }\r\n    payload = {\r\n        \"label\": IP_ACCESS_LIST_NAME,\r\n        \"list_type\": \"ALLOW\",\r\n        \"ip_addresses\": ip_prefixes,\r\n        \"enabled\": True\r\n    }\r\n    response = requests.patch(url, headers=headers, data=json.dumps(payload))\r\n    response.raise_for_status()\r\n    return response.json()\r\n\r\n# Main script logic\r\ndef main(dry_run=True):\r\n    try:\r\n        # Get latest service tags URL\r\n        latest_url = get_latest_service_tags_url()\r\n        print(f\"Using Service Tags URL: {latest_url}\")\r\n\r\n        # Get Azure AD prefixes\r\n        azure_ad_prefixes = get_azure_ad_prefixes(latest_url)\r\n        \r\n        # Get the IP access list ID and current IPs for the given name\r\n        ip_access_list_id, current_ip_prefixes = get_ip_access_list_id(IP_ACCESS_LIST_NAME)\r\n        \r\n        # Determine changes\r\n        new_prefixes = set(azure_ad_prefixes) - set(current_ip_prefixes)\r\n        removed_prefixes = set(current_ip_prefixes) - set(azure_ad_prefixes)\r\n        \r\n        if dry_run:\r\n            print(\"Dry Run: Changes that will be made\")\r\n            print(\"New prefixes to add:\", new_prefixes)\r\n            print(\"Prefixes to remove:\", removed_prefixes)\r\n            return\r\n\r\n        # Update the IP access list with the new prefixes\r\n        updated_ips = list(set(azure_ad_prefixes))\r\n        response = update_ip_access_list(ip_access_list_id, updated_ips)\r\n        print(\"IP access list updated successfully:\", response)\r\n\r\n    except Exception as e:\r\n        print(f\"An error occurred: {e}\")\r\n\r\nif __name__ == \"__main__\":\r\n    \r\n    parser = argparse.ArgumentParser(description=\"Update Databricks IP Access List with Azure AD IPs\")\r\n    parser.add_argument(\"--dry-run\", type=bool, default=True, help=\"Perform a dry run to show changes without applying them (default: True)\")\r\n    args = parser.parse_args()\r\n    \r\n    main(dry_run=args.dry_run)\nFor more information, refer to the\nMicrosoft Entra on-premises application provisioning to SCIM-enabled apps\ndocumentation.\nFor quick access to the IP address list file with the AzureActiveDirectory tag, refer to the\nAzure IP Ranges and Service Tags – Public Cloud\nwebpage." +} \ No newline at end of file diff --git a/scraped_kb_articles/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command.json b/scraped_kb_articles/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command.json new file mode 100644 index 0000000000000000000000000000000000000000..7dfb12f6cd2ccebb9fd94a477ea30dade091252a --- /dev/null +++ b/scraped_kb_articles/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile attempting to query a shared table using the Delta Sharing client in a notebook, you try to execute the\ndelta_sharing.load_as_spark(table_url)\nmethod.\nYou receive a 404 error.\nrequests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://./delta_sharing/retrieve_config.html/shares\nCause\nThe endpoint URL used by the Delta Sharing client is incorrectly configured. It uses “\nHTTPS\n” in the path to the external location, which is not supported in the\nload_as_spark\nmethod.\nSolution\nCheck your endpoint URL and use the correct external location URI.\nConfirm that the endpoint URL in the share file matches the expected format. The call to the\ndelta_sharing.load_as_spark(table_url)\nmethod from the\ndelta-sharing\nPython library expects a valid URI that can be accessed remotely by the Spark driver, in the following format.\nAWS S3 (s3a://)\nAzure Blob Storage (abfss://)\nGCP GCS (gs://your-bucket/path/to/data)\nSelect and replace\n\nin the following code snippet with your URI, based on your cloud platform in the previous step.\nshare_file_path = ':///delta-sharing/share/open-datasets.share'\r\n\r\ntable_url = f\"{share_file_path}#delta_sharing.default.file_name\"\r\n\r\nshared_df = delta_sharing.load_as_spark(table_url)\r\n\r\ndisplay(shared_df)\nFor more information, refer to the\nDelta Sharing Receiver Quickstart\nnotebook." +} \ No newline at end of file diff --git a/scraped_kb_articles/404-error-when-installing-krb5-user-module.json b/scraped_kb_articles/404-error-when-installing-krb5-user-module.json new file mode 100644 index 0000000000000000000000000000000000000000..a5027af91683220fdd82c67813a2279bb1664498 --- /dev/null +++ b/scraped_kb_articles/404-error-when-installing-krb5-user-module.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/404-error-when-installing-krb5-user-module", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to install the\nkrb5-user\npackage on your cluster’s Ubuntu OS, you receive a 404 error.\nUnable to correct missing packages.\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libgssrpc4_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libkadm5clnt-mit11_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libkdb5-9_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libkadm5srv-mit11_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/k/krb5/krb5-user_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Aborting install.;\nCause\nWhen there is a package update in the\nkrb5-user Ubuntu repository\n, the previous version is removed.\nIn the Ubuntu OS, package references saved within\n/var/lib/apt/lists/\nare cached, causing the package info to point to the previous, now no longer available, version.\nNote\nThe command\napt-get clean\ndoesn't clear the\n/var/lib/apt/lists/\ndirectory. This is a common OS behavior.\nSolution\nManually remove the\n/var/lib/apt/lists/\npath to refresh the cached data.\nsudo rm -rf /var/lib/apt/lists/*\r\nsudo apt-get -y update\r\nsudo apt-get install -y libkrb5-user" +} \ No newline at end of file diff --git a/scraped_kb_articles/502-error-when-trying-to-access-the-spark-ui.json b/scraped_kb_articles/502-error-when-trying-to-access-the-spark-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..3405fda0ebd793be6343ef21701f3cfda6971601 --- /dev/null +++ b/scraped_kb_articles/502-error-when-trying-to-access-the-spark-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/502-error-when-trying-to-access-the-spark-ui", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to access the Apache Spark UI in Databricks, you encounter a 502 error even though your Spark job is running without issues.\n502 Bad Gateway: The server returned an invalid or incomplete response.\nThis error can occur in environments such as Data Engineering and Machine Learning, specifically when working with large Delta Lake tables.\nWhen you check the driver logs, you notice the following error in the driver logs.\njava.lang.StackOverflowError\nCause\nThe Spark UI stores data in memory by default. When a Spark job generates a large amount of data, that data can overflow memory, forcing the driver to address the overflow. The HTTP server, which is within the driver, cannot then respond to HTTP requests properly and throws a 502 error.\nSolution\nEnable the configuration to store Spark UI data on disk instead of in memory. This helps prevent the Spark UI from running out of memory.\nTo enable this configuration, add the following line to your Spark configuration in your cluster settings. Edit your cluster, scroll to\nAdvanced options > Spark\ntab, then in the\nSpark config\nfield, input the following setting.\nspark.ui.store.path /databricks/driver/sparkuirocksdb\nIf you prefer to enable the configuration using a notebook, you can use the following Python code.\nspark.conf.set(\"spark.ui.store.path\", \"/databricks/driver/sparkuirocksdb\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/abfs-client-hang.json b/scraped_kb_articles/abfs-client-hang.json new file mode 100644 index 0000000000000000000000000000000000000000..e13e233ec06c714c433be2e7a3459b14a0768aba --- /dev/null +++ b/scraped_kb_articles/abfs-client-hang.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/abfs-client-hang", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Azure Data Lake Storage (ADLS) Gen2. When you try to access an Azure Blob File System (ABFS) path from a Databricks cluster, the command hangs.\nEnable the debug log and you can see the following stack trace in the driver logs:\nCaused by: java.io.IOException: Server returned HTTP response code: 400 for URL: https://login.microsoftonline.com/b9b831a9-6c10-40bf-86f3-489ed83c81e8/oauth2/token\r\n  at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1894)\r\n  at sun.net.www.protocol.http.HttpURLConnection.access$200(HttpURLConnection.java:91)\r\n  at sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1484)\r\n  at sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1482)\r\n  at java.security.AccessController.doPrivileged(Native Method)\r\n  at java.security.AccessController.doPrivilegedWithCombiner(AccessController.java:782)\r\n  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1481)\r\n  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)\r\n  at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)\r\n  at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator.getTokenSingleCall(AzureADAuthenticator.java:254)\r\n  ... 31 more\nCause\nIf ABFS is configured on a cluster with a wrong value for property\nfs.azure.account.oauth2.client.id\n, or if you try to access an explicit path of the form\nabfss://myContainer@myStorageAccount.dfs.core.windows.net/...\nwhere\nmyStorageAccount\ndoes not exist, then the ABFS driver ends up in a retry loop and becomes unresponsive. The command will eventually fail, but because it retries so many times, it appears to be a hung command.\nIf you try to access an incorrect path with an existing storage account, you will see a 404 error message. The system does not hang in this case.\nSolution\nYou must verify the accuracy of all credentials when accessing ABFS data. You must also verify the ABFS path you are trying to access exists. If either of these are incorrect, the problem occurs." +} \ No newline at end of file diff --git a/scraped_kb_articles/access-adls1-from-sparklyr.json b/scraped_kb_articles/access-adls1-from-sparklyr.json new file mode 100644 index 0000000000000000000000000000000000000000..da239e5a7af148ce3df073109ff811bc27c0600f --- /dev/null +++ b/scraped_kb_articles/access-adls1-from-sparklyr.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/access-adls1-from-sparklyr", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a cluster with Azure AD Credential Passthrough enabled, commands that you run on that cluster are able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.\nFor example, you can directly access data using\n%python\r\n\r\nspark.read.csv(\"adl://myadlsfolder.azuredatalakestore.net/MyData.csv\").collect()\nHowever, when you try to access data directly using Sparklyr:\n%r\r\n\r\nspark_read_csv(sc, name = \"air\", path = \"adl://myadlsfolder.azuredatalakestore.net/MyData.csv\")\nIt fails with the error:\ncom.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token\nCause\nThe\nspark_read_csv\nfunction in Sparklyr is not able to extract the ADLS token to enable authentication and read data.\nSolution\nA workaround is to use an Azure\napplication id\n,\napplication key\n, and\ndirectory id\nto mount the ADLS location in DBFS:\n%python\r\n\r\n# Get credentials and ADLS URI from Azure\r\napplicationId= \r\napplicationKey= \r\ndirectoryId= \r\nadlURI=\r\nassert adlURI.startswith(\"adl:\"), \"Verify the adlURI variable is set and starts with adl:\"\r\n\r\n# Mount ADLS location to DBFS\r\ndbfsMountPoint=\r\ndbutils.fs.mount(\r\n  mount_point = dbfsMountPoint,\r\n  source = adlURI,\r\n  extra_configs = {\r\n    \"dfs.adls.oauth2.access.token.provider.type\": \"ClientCredential\",\r\n    \"dfs.adls.oauth2.client.id\": applicationId,\r\n    \"dfs.adls.oauth2.credential\": applicationKey,\r\n    \"dfs.adls.oauth2.refresh.url\": \"\nhttps://login.microsoftonline.com/\n{}/oauth2/token\".format(directoryId)\r\n  })\nThen, in your R code, read data using the mount point:\n%r\r\n\r\n# Install Sparklyr\r\n%r\r\ninstall.packages(\"sparklyr\")\r\nlibrary(sparklyr)\r\n# Create a sparklyr connection\r\nsc <- spark_connect(method = \"databricks\")\r\n\r\n# Read Data\r\n%r\r\nmyData = spark_read_csv(sc, name = \"air\", path = \"dbfs://myData.csv\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/access-blob-fails-wasb.json b/scraped_kb_articles/access-blob-fails-wasb.json new file mode 100644 index 0000000000000000000000000000000000000000..eba15706364b709669577fc7ec8798d890d1ec56 --- /dev/null +++ b/scraped_kb_articles/access-blob-fails-wasb.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/access-blob-fails-wasb", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to access an already created mount point or create a new mount point, it fails with the error:\nWASB: Fails with java.lang.NullPointerException\nCause\nThis error can occur when the root mount path (such as\n/mnt/\n) is also mounted to blob storage. Run the following command to check if the root path is also mounted:\n%python\r\n\r\ndbutils.fs.mounts()\nCheck if\n/mnt\nappears in the list.\nSolution\nUnmount the\n/mnt/\nmount point using the command:\n%python\r\n\r\ndbutils.fs.unmount(\"/mnt\")\nNow you should be able to access your existing mount points and create new ones." +} \ No newline at end of file diff --git a/scraped_kb_articles/access-blobstore-odbc.json b/scraped_kb_articles/access-blobstore-odbc.json new file mode 100644 index 0000000000000000000000000000000000000000..af69d3772714ebc12a618913b31b397a27a43de0 --- /dev/null +++ b/scraped_kb_articles/access-blobstore-odbc.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/access-blobstore-odbc", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDelete\nInfo\nIn general, you should use Databricks Runtime 5.2 and above, which include a built-in Azure Blob File System (ABFS) driver, when you want to access Azure Data Lake Storage Gen2 (ADLS Gen2). This article applies to users who are accessing ADLS Gen2 storage using JDBC/ODBC instead.\nWhen you run a SQL query from a JDBC or ODBC client to access ADLS Gen2, the following error occurs:\ncom.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: No value for dfs.adls.oauth2.access.token.provider found in conf file.\r\n\r\n18/10/23 21:03:28 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING,\r\njava.util.concurrent.ExecutionException: java.io.IOException: There is no primary group for UGI (Basic token)chris.stevens+dbadmin (auth:SIMPLE)\r\n  at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)\r\n  at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)\r\n  at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)\r\n  at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)\r\n  at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)\r\n  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)\r\n  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)\r\n  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)\r\n  at com.google.common.cache.LocalCache.get(LocalCache.java:3932)\r\n  at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)\r\n  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:158)\r\n  at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:257)\r\n  at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:313)\r\n  at\r\n  at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)\r\n  at scala.collection.immutable.List.foldLeft(List.scala:84)\r\n  at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:87)\r\n  at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:79)\nWhen you run the query from the SQL client, you get the following error:\nAn error occurred when executing the SQL command:\r\nselect * from test_databricks limit 50\r\n\r\n[Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: com.google.common.util.concurrent.UncheckedExecutionException: com.databricks.backend.daemon.data.common.InvalidMountException: Error while using path /mnt/crm_gen2/phonecalls for resolving path '/phonecalls' within mount at '/mnt/crm_gen2'., Query: SELECT * FROM `default`.`test_databricks` `default_test_databricks` LIMIT 50. [SQL State=HY000, DB Errorcode=500051]\r\n\r\nWarnings:\r\n[Simba][SparkJDBCDriver](500100) Error getting table information from database.\nCause\nThe root cause is incorrect configuration settings to create a JDBC or ODBC connection to ABFS via ADLS Gen2, which cause queries to fail.\nSolution\nSet\nspark.hadoop.hive.server2.enable.doAs\nto\nfalse\nin the cluster configuration settings." +} \ No newline at end of file diff --git a/scraped_kb_articles/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance.json b/scraped_kb_articles/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance.json new file mode 100644 index 0000000000000000000000000000000000000000..3e069e9841a274befd784b26faabcbac04e66241 --- /dev/null +++ b/scraped_kb_articles/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to connect to SharePoint Online from an on-premises Databricks instance to pull content hosted in SharePoint Online, you receive an\nAcquire app-only access token failed\nerror.\nYou’re using Azure AD credentials for authentication, and the connection succeeds from your local computer but fails when attempted from Databricks.\nCause\nYour network is blocking outbound traffic to\nlogin.microsoftonline.com\non port 443. This domain is essential for authenticating with Azure AD and acquiring an app-only access token, which is required for accessing SharePoint resources.\nSolution\nConfirm network connectivity using the following command to verify if the proxy allows traffic to\nlogin.microsoftonline.com\non port 443.\n%sh\r\nnc -zv -x xxxxx:443 login.microsoftonline.com 443\nIf you encounter a disconnection error, work with your network/firewall team to ensure that access to\nlogin.microsoftonline.com\non port 443 is allowed from the on-premises Databricks environment." +} \ No newline at end of file diff --git a/scraped_kb_articles/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api.json b/scraped_kb_articles/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api.json new file mode 100644 index 0000000000000000000000000000000000000000..2f8554d183172de99f587359e0778d930526b6fc --- /dev/null +++ b/scraped_kb_articles/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/active-vs-dead-jobs.json b/scraped_kb_articles/active-vs-dead-jobs.json new file mode 100644 index 0000000000000000000000000000000000000000..9271e9fd866e499dba5343b67c7183085a128f8c --- /dev/null +++ b/scraped_kb_articles/active-vs-dead-jobs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/active-vs-dead-jobs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nOn clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages.\nCause\nWhenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal\neventListenerBus\ndrops events. These events are used to track job progress in the Spark UI. Whenever the event listener drops events you start seeing dead jobs/stages in Spark UI, which never finish. The jobs are actually finished but not shown as completed in the Spark UI.\nYou observe the following traces in driver logs:\n18/01/25 06:37:32 WARN LiveListenerBus: Dropped 5044 SparkListenerEvents since Thu Jan 25 06:36:32 UTC 2018\nSolution\nThere is no way to remove dead jobs from the Spark UI without restarting the cluster. However, you can identify the active jobs and stages by running the following commands:\n%scala\r\n\r\nsc.statusTracker.getActiveJobIds()  // Returns an array containing the IDs of all active jobs.\r\nsc.statusTracker.getActiveStageIds() // Returns an array containing the IDs of all active stages." +} \ No newline at end of file diff --git a/scraped_kb_articles/add-custom-tags-to-a-delta-live-tables-pipeline.json b/scraped_kb_articles/add-custom-tags-to-a-delta-live-tables-pipeline.json new file mode 100644 index 0000000000000000000000000000000000000000..4d2fefcb7ae36ad2ee98e7ea72ef52066b0e5d61 --- /dev/null +++ b/scraped_kb_articles/add-custom-tags-to-a-delta-live-tables-pipeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/add-custom-tags-to-a-delta-live-tables-pipeline", + "title": "Título do Artigo Desconhecido", + "content": "When managing Delta Live Tables pipelines on your clusters, you may want to use custom tags for internal tracking. For example, you may want to use tags to allocate cost across different departments. Or your organization might have a global cluster policy that requires tags on the instances. Failure to comply with a cluster policy can result in cluster start up failures.\nInstructions\nThe Create Pipeline UI does not have an option to add additional tags. Instead, you must add custom tags manually, by editing the JSON configuration.\nClick\nWorkflows\nin the left sidebar menu.\nClick\nDelta Live Tables\n.\nClick\nCreate Pipeline\n.\nEnter your pipeline information in the UI.\nClick\nJSON\nin the upper right to switch to the JSON view.\nAdd your custom tags to the\nclusters\nsection of the JSON file. The\ncustom_tags\nblock should be placed right below the\nlabel\nblock.\n{\r\n    \"clusters\": [\r\n        {\r\n            \"label\": \"default\",\r\n            \"custom_tags\": {\r\n                \"\": \"\",\r\n                \"\": \"\",\r\n                \"\": \"\"\r\n            },\r\n            \"autoscale\": {\r\n                \"min_workers\": 1,\r\n                \"max_workers\": 5,\r\n                \"mode\": \"ENHANCED\"\r\n            }\r\n        }\r\n    ],\r\n    \"development\": true,\r\n    \"continuous\": false,\r\n    \"channel\": \"CURRENT\",\r\n    \"edition\": \"ADVANCED\",\r\n    \"photon\": false,\r\n    \"libraries\": []\r\n}\nExample\n3. For the existing pipelines: click on Delta Live Tables tables UI and select the desired pipeline. Click on settings tab and switch to JSON view as show above.\nDelete\nInfo\nYou can add custom tags to an existing pipeline by editing the settings and using the JSON view to add the tag information. The JSON view can be used to add or edit any cluster property without using the UI." +} \ No newline at end of file diff --git a/scraped_kb_articles/add-libraries-to-a-job-cluster-to-reduce-idle-time.json b/scraped_kb_articles/add-libraries-to-a-job-cluster-to-reduce-idle-time.json new file mode 100644 index 0000000000000000000000000000000000000000..e0743468a67e08dad9522bd6e7969e3015b1a420 --- /dev/null +++ b/scraped_kb_articles/add-libraries-to-a-job-cluster-to-reduce-idle-time.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/add-libraries-to-a-job-cluster-to-reduce-idle-time", + "title": "Título do Artigo Desconhecido", + "content": "Problem:\nYou have an automated job that requires the use of external Maven libraries.\nYou created a separate cluster with the libraries installed, but it incurs idle time, resulting in unnecessary costs.\nSolution:\nTo add libraries to a job cluster, follow these steps:\nCreate a job in Databricks.\nClick\nAdd\nnext to dependent libraries.\nIn the pop-up window, add the required libraries.\nTo reduce idle time in a job cluster, you have two options:\nOpt out of auto termination by clearing the\nAuto Termination\ncheckbox.\nSpecify an inactivity period of 0.\nDatabricks recommends running jobs on a job cluster, rather than an interactive cluster with auto termination.\nJob clusters automatically terminate once the job completes, ensuring efficient resource utilization." +} \ No newline at end of file diff --git a/scraped_kb_articles/addressing-performance-issues-with-over-partitioned-delta-tables.json b/scraped_kb_articles/addressing-performance-issues-with-over-partitioned-delta-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..8070d186f0e9307d4ae75109d5b32c5504fc7593 --- /dev/null +++ b/scraped_kb_articles/addressing-performance-issues-with-over-partitioned-delta-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/addressing-performance-issues-with-over-partitioned-delta-tables", + "title": "Título do Artigo Desconhecido", + "content": "Info\nThis article applies to Databricks Runtime 15.2 and above.\nProblem\nWhen working with Delta tables, you notice that your\nDESCRIBE HISTORY\n,\nDESCRIBE FORMATTED\n, and\nDESCRIBE EXTENDED\nqueries execute slowly. You may also see bloated Delta logs or driver out-of-memory (OOM) errors.\nCause\nYour Delta tables are over-partitioned: you have less than 1 GB of data in a given partition, whether from a single file or multiple small files, but the table can accommodate more.\nWhen a Delta table is divided into too many partitions, each containing a small amount of data, the system's performance can degrade trying to manage the increased number of files and associated overhead.\nSolution\nImplement liquid clustering to simplify data layout decisions and optimize query performance. Liquid clustering helps distribute data more efficiently and reduce the overhead associated with managing a large number of small partitions.\nFor more information, please review the\nUse liquid clustering for Delta tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nYou can also optimize the table partitioning layout to ensure that each partition contains approximately 1 GB of data or more. Reduce the number of partitions and merge smaller files.\nFor more information on the ideal partition size, please refer to the\nWhen to partition tables on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/adls-gen1-firewall-access.json b/scraped_kb_articles/adls-gen1-firewall-access.json new file mode 100644 index 0000000000000000000000000000000000000000..ff0e0616bdb192c5f9546f2f70d0f4842cf478af --- /dev/null +++ b/scraped_kb_articles/adls-gen1-firewall-access.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/adls-gen1-firewall-access", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you have a firewall enabled on your Azure virtual network (VNet) and you try to access ADLS using the ADLS Gen1 connector, it fails with the error:\n328 format(target_id, \".\", name), value) 329 else: 330 raise Py4JError(Py4JJavaError:\r\nAn error occurred while calling o196.parquet.: java.lang.RuntimeException:\r\nCould not find ADLS Token at com.databricks.backend.daemon.data.client.adl.AdlCredentialContextTokenProvider$$anonfun$get Token$1.apply(AdlCredentialContextTokenProvider.scala:18)\r\nat com.databricks.backend.daemon.data.client.adl.AdlCredentialContextTokenProvider$$anonfun$get\r\nToken$1.apply(AdlCredentialContextTokenProvider.scala:18)\r\nat scala.Option.getOrElse(Option.scala:121)\r\nat com.databricks.backend.daemon.data.client.adl.AdlCredentialContextTokenProvider.getToken(AdlCredentialContextTokenProvider.scala:18)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getAccessToken(ADLStoreClient.java:1036)\r\nat com.microsoft.azure.datalake.store.HttpTransport.makeSingleCall(HttpTransport.java:177)\r\nat com.microsoft.azure.datalake.store.HttpTransport.makeCall(HttpTransport.java:91)\r\nat com.microsoft.azure.datalake.store.Core.getFileStatus(Core.java:655)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:735)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:718)\r\nat com.databricks.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:423)\r\nat org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)\r\nat org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:94)\nCause\nThis is a known issue with the ADLS Gen1 connector. Connecting to ADLS Gen1 when a firewall is enabled is unsupported.\nSolution\nUse\nADLS Gen2\ninstead." +} \ No newline at end of file diff --git a/scraped_kb_articles/adls-gen1-mount-problem.json b/scraped_kb_articles/adls-gen1-mount-problem.json new file mode 100644 index 0000000000000000000000000000000000000000..9f83a9958cedd278ada76ad0497ce33d47bc0270 --- /dev/null +++ b/scraped_kb_articles/adls-gen1-mount-problem.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/adls-gen1-mount-problem", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to mount an Azure Data Lake Storage (ADLS) Gen1 account on Databricks, it fails with the error:\ncom.microsoft.azure.datalake.store.ADLException: Error creating directory /\r\nError fetching access token\r\nOperation null failed with exception java.io.IOException : Server returned HTTP response code: 401 for URL: https://login.windows.net/18b0b5d6-b6eb-4f5d-964b-c03a6dfdeb22/oauth2/token\r\nLast encountered exception thrown after 5 tries. [java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException]\r\n [ServerRequestId:null]\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1169)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.createDirectory(ADLStoreClient.java:589)\r\nat com.databricks.adl.AdlFileSystem.mkdirs(AdlFileSystem.java:533)\r\nAt com.databricks.backend.daemon.data.client.DatabricksFileSystemV2$$anonfun$mkdirs$1$$anonfun$apply$mcZ$sp$7$$anonfun$apply$mcZ$sp$8.apply$mcZ$sp(DatabricksFileSystemV2.scala:638)\nCause\nThis error can occur if the ADLS Gen1 account was previously mounted in the workspace, but not unmounted, and the credential used for that mount subsequently expired. When you try to mount the same account with a new credential, there is a conflict between the expired and new credentials.\nSolution\nYou need to unmount all existing mounts, and then create a new mount with a new, unexpired credential.\nFor more information, see\nMount Azure Data Lake Storage Gen1 with DBFS (AWS)\nand\nMount Azure Data Lake Storage Gen1 with DBFS (Azure)\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables.json b/scraped_kb_articles/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..69a51c2b8ca924ff3891164c5c2a9f62b6865c64 --- /dev/null +++ b/scraped_kb_articles/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIt is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names.\nWhen you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an\nAnalysisException\nerror message.\nFor example, if you try to read a JSON file, evaluate the DataFrame, and then write it out to a Delta table on DBR 10.2 or below it returns an error.\n%scala\r\n\r\nval df = spark.read.json(\"\") \r\ndf.write.format(\"delta\").mode(\"overwrite\").save(\"\")\nThe expected error message is visible in the stack trace.\nAnalysisException: Attribute name \"stage_info.Accumulables.Count Failed Values\" contains invalid character(s) among \" ,;{}()\\n\\t=\". Please use alias to rename it.\nCause\nOne of the nested column names in the DataFrame contains spaces, which is preventing you from writing the output to the Delta table.\nSolution\nIf your source files are straightforward, you can use\nwithColumnRenamed\nto rename multiple columns and remove spaces. However, this can quickly get complicated with a nested schema.\nwithColumn\ncan be used to flatten nested columns and rename the existing column (with spaces) to a new column name (without spaces). In case of a large schema, flattening all of the nested columns in the DataFrame can be a tedious task.\nIf your clusters are using Databricks Runtime 10.2 or above you can avoid the issue entirely by enabling column mapping mode. Column mapping mode allows the use of spaces as well as\n, ; { } ( ) \\n \\t =\ncharacters in table column names.\nSet the Delta table property\ndelta.columnMapping.mode\nto\nname\nto enable column mapping mode.\nThis sample code sets up a Delta table that can support nested column names with spaces, however it does require a cluster running Databricks Runtime 10.2 or above.\n%scala\r\n\r\nimport io.delta.tables.DeltaTable\r\n\r\nval df = spark.read.json(\"\") \r\nDeltaTable.create()\r\n .addColumns(df.schema)\r\n .property(\"delta.minReaderVersion\", \"2\")\r\n .property(\"delta.minWriterVersion\", \"5\")\r\n .property(\"delta.columnMapping.mode\", \"name\")\r\n .location(\"\")\r\n .execute()\r\n\r\ndf.write.format(\"delta\").mode(\"append\").save(\"\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/alter-table-command-only-reordering-on-metadata-level.json b/scraped_kb_articles/alter-table-command-only-reordering-on-metadata-level.json new file mode 100644 index 0000000000000000000000000000000000000000..7428e8428e9acc5be3f08c7e0f13e8ec0f9531c2 --- /dev/null +++ b/scraped_kb_articles/alter-table-command-only-reordering-on-metadata-level.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/alter-table-command-only-reordering-on-metadata-level", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a pre-existing Delta table and want to add a new column,\n''\n. When you use the following ALTER TABLE command, the reordering is only adjusted at the metadata level.\nALTER TABLE ALTER COLUMN AFTER ;\nCause\nALTER TABLE\naffects the metadata level only. It does not change the physical layout of the parquet files underneath. Limiting its impact ensure metadata-only changes to the schema, such as renaming or dropping columns, do not incur costly rewrites of the underlying data files.\nFor more information, refer to the\nUpdate Delta Lake table schema\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nTo rewrite files with new column order taken into account, use either\nCREATE OR REPLACE TABLE\nor\nINSERT OVERWRITE TABLE\ninstead of\nALTER TABLE\n.\nCREATE OR REPLACE TABLE\nCREATE OR REPLACE TABLE
AS\r\nSELECT , , , ... -- desired column order\r\nFROM
\nINSERT OVERWRITE TABLE\nINSERT OVERWRITE TABLE
\r\nSELECT , , , ...\r\nFROM
" +} \ No newline at end of file diff --git a/scraped_kb_articles/alter-table-drop-partition-error-in-unity-catalog-external-tables.json b/scraped_kb_articles/alter-table-drop-partition-error-in-unity-catalog-external-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..01336f57ae2bb3b99204203a0277786ac729e72f --- /dev/null +++ b/scraped_kb_articles/alter-table-drop-partition-error-in-unity-catalog-external-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/alter-table-drop-partition-error-in-unity-catalog-external-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run\nALTER TABLE .. DROP PARTITION \non a Unity Catalog external table, you encounter an error.\nSQL query error : [UC_COMMAND_NOT_SUPPORTED.WITHOUT_RECOMMENDATION] ALTER TABLE (drop partition) are not supported in Unity Catalog.\nCause\nUnity Catalog does not store table partition information, so\nDROP PARTITION\nis not supported on Unity Catalog tables.\nSolution\nFor Unity Catalog external tables with CSV, JSON, ORC, or Parquet data formats, use partition metadata logging to resolve the issue.\nSet the\nnonDelta.partitionLog.enabled\nto\ntrue\nfor the Apache Spark session while creating the table.\nSET spark.databricks.nonDelta.partitionLog.enabled = true;\nRe-create or create the table.\nRe-run\nALTER TABLE .. DROP PARTITION \n.\nExample\nSet the config for partition metadata logging.\n%sql\r\n\r\nSET spark.databricks.nonDelta.partitionLog.enabled = true;\nCreate the table.\n%sql\r\n\r\nCREATE OR REPLACE TABLE ..\r\nUSING \r\nPARTITIONED BY ()\r\nLOCATION 's3:///';\nRe-run the command.\n%sql\r\n\r\nALTER TABLE .. DROP PARTITION \nFor more information, please refer to the\nPartition discovery for external tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/analysisexception-error-due-to-a-schema-mismatch.json b/scraped_kb_articles/analysisexception-error-due-to-a-schema-mismatch.json new file mode 100644 index 0000000000000000000000000000000000000000..4c90bca1d5aa046f9b74f3a76146702f6e24f1f2 --- /dev/null +++ b/scraped_kb_articles/analysisexception-error-due-to-a-schema-mismatch.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/analysisexception-error-due-to-a-schema-mismatch", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are writing to a Delta table when you get an\nAnalysisException\nerror indicating a schema mismatch.\n'AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: bc10as3e-e12va-4f325-av10e-4s38f17vr3dd3)'. input_df.write.format(\"delta\").mode(\"overwrite\").save(target_delta_table_path)\nCause\nThe scheme mismatch is due to a change in the source schema and target schema. This usually happens when you introduce new columns to the target table during the write operation. This change in schema is a case of schema evolution. If these changes are expected, you should enable the\nmergeSchema\nproperty.\nSolution\nModify the write command and set the\nmergeSchema\nproperty to true.\ninput_df.write.format(\"delta\").mode(\"overwrite\").option(\"mergeSchema\", \"true\").save(target_delta_table_path)\nFor more information, please review the\nEnable schema evolution\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-.json b/scraped_kb_articles/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-.json new file mode 100644 index 0000000000000000000000000000000000000000..c3dbbe0a4234564ae2056bb648c69ce339cd8170 --- /dev/null +++ b/scraped_kb_articles/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to write dataframes into OpenSearch indices using the\norg.opensearch.client:opensearch-spark-30_2.12\nlibrary when you encounter the error\nAnalysisException: Incompatible format detected\n.\nCause\nThis error is caused by the presence of a\n_delta_log\nfolder in the root directory of the Databricks File System (DBFS).\nWhen Apache Spark detects this folder, it assumes the target path is a Delta table, leading to the\nAnalysisException\nerror when attempting to write non-Delta data using the OpenSearch format. The folder's presence causes Spark to misinterpret the target path.\nSolution\nVerify the presence of the\n_delta_log\nfolder in the root directory.\ndbutils.fs.ls(\"dbfs:/_delta_log/\")\nIf the folder is present, delete it to remove the conflict. Ensure you take a backup if necessary before deletion.\ndbutils.fs.rm(\"dbfs:/_delta_log/\", True)\nAfter deleting the\n_delta_log\nfolder, re-run the job to write the dataframe into OpenSearch using the specified function. Please adjust the parameters as required for your environment. The following is an example for an upsert condition in us-east-1 region.\ndef saveToElasticSearch(df):\r\n  df.write.format(\"org.opensearch.spark.sql\")\\\r\n    .option(\"opensearch.nodes\", openSearchDomainPath)\\\r\n    .option(\"opensearch.nodes.wan.only\", True)\\\r\n    .option(\"opensearch.port\",\"443\")\\\r\n    .option(\"opensearch.net.ssl\", \"true\")\\\r\n    .option(\"opensearch.aws.sigv4.enabled\", \"true\")\\\r\n    .option(\"opensearch.aws.sigv4.region\", \"us-east-1\")\\\r\n    .option(\"opensearch.batch.size.entries\",200)\\\r\n    .option(\"opensearch.mapping.id\",\"id\")\\\r\n    .option(\"opensearch.write.operation\", \"upsert\")\\\r\n    .mode(\"append\")\\\r\n .save(index)\nConfirm that the job now completes successfully.\nNote\nAs a preventive measure, avoid creating Delta tables at the root location in DBFS to prevent similar conflicts in the future. Regularly check and clean up any unintended folders that may interfere with your workflows.\nFor more information, please review the\nWhat is Delta Lake?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/analyze-statement-not-working-on-delta-live-tables-dlt.json b/scraped_kb_articles/analyze-statement-not-working-on-delta-live-tables-dlt.json new file mode 100644 index 0000000000000000000000000000000000000000..751b9a20337340fc0063e5eaf88fd754ec3cfd3e --- /dev/null +++ b/scraped_kb_articles/analyze-statement-not-working-on-delta-live-tables-dlt.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/analyze-statement-not-working-on-delta-live-tables-dlt", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use the Delta Live Tables (DLT) service to load data into streaming tables, in order to execute\nANALYZE\nstatements on Unity Catalog tables, you receive an error.\n\"PERMISSION_DENIED: MV/ST internal properties can only be updated through DLT. SQLSTATE: 42501\".\nCause\nANALYZE\nstatements are not supported for streaming tables.\nSolution\nWith Delta Live Tables, running\nANALYZE TABLE\nis not necessary. A DLT pipeline has a process that runs every 24 hours, called the maintenance job. This job performs optimization and vacuum on Delta tables that are part of the DLT pipeline." +} \ No newline at end of file diff --git a/scraped_kb_articles/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql.json b/scraped_kb_articles/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql.json new file mode 100644 index 0000000000000000000000000000000000000000..31dd57adba407f4ae653c7fe1ee5ae4294d95ba8 --- /dev/null +++ b/scraped_kb_articles/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/ansi-compliant-decimal-precision-and-scale.json b/scraped_kb_articles/ansi-compliant-decimal-precision-and-scale.json new file mode 100644 index 0000000000000000000000000000000000000000..ad230978cb849711d9c530b7399d9306d6e076d5 --- /dev/null +++ b/scraped_kb_articles/ansi-compliant-decimal-precision-and-scale.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/ansi-compliant-decimal-precision-and-scale", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to cast a value of one or greater as a\nDECIMAL\nusing equal values for both precision and scale. A null value is returned instead of the expected value.\nThis sample code:\n%sql\r\n\r\nSELECT CAST (5.345 AS DECIMAL(20,20))\nReturns:\nCause\nThe\nDECIMAL\ntype (\nAWS\n|\nAzure\n|\nGCP\n) is declared as\nDECIMAL(precision, scale)\n, where precision and scale are optional.\nPrecision represents the total number of digits in the value of the variable. This includes the whole number part and the fractional part.\nThe scale represents the number of fractional digits in the value of the variable. Put simply, this is the number of digits to the right of the decimal point.\nFor example, the number 123.45 has a\nprecision\nof five (as there are five total digits) and a\nscale\nof two (as only two digits are on the right-hand side of the decimal point).\nWhen the precision and scale are equal, it means the value is less than one, as all digits are used for the fractional part of the number. For example,\nDECIMAL(20, 20)\ndefines a value with 20 digits and 20 digits to the right of the decimal point. All 20 digits are used to represent the fractional part of the number, with no digits used for the whole number.\nIf the precision in the value overflows the precision defined in the datatype declaration, null is returned instead of the fractional decimal value.\nSolution\nSet\nspark.sql.ansi.enabled\nto\ntrue\nin your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nThis enables Spark SQL ANSI compliance.\nFor more information, review the\nANSI compliance\ndocumentation.\nDelete\nInfo\nYou can also set this value at the notebook level using\nspark.conf.set(\"spark.sql.ansi.enabled\", \"True\")\nin a Python cell if you don't have the ability to edit the cluster's\nSpark config\n.\nOnce ANSI compliance is enabled, passing incorrect precision and scale values returns an error indicating the correct value.\nFor example, this sample code:\n%sql\r\n\r\nSELECT CAST (5.345 AS DECIMAL(20,20))\nReturns this error message:\nError in SQL statement: SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded, 5.345, 4, 3) cannot be represented as Decimal(20, 20).\nHowever, this sample code:\nSELECT CAST (5.345 AS DECIMAL(4,3))\nReturns the expected result:" +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-airflow-triggered-jobs-terminating-before-completing.json b/scraped_kb_articles/apache-airflow-triggered-jobs-terminating-before-completing.json new file mode 100644 index 0000000000000000000000000000000000000000..893d6e02f5c9f5a4791657d1d5e267c62455a4be --- /dev/null +++ b/scraped_kb_articles/apache-airflow-triggered-jobs-terminating-before-completing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/apache-airflow-triggered-jobs-terminating-before-completing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a job triggered through Apache Airflow using the\nDatabricksRunNowOperator\nthat cancels after running for X hours (where X is the timeout value set through Airflow) even though the job is not complete.\nCause\nThe Airflow\nDatabricksRunNowOperator\noperator uses the X hours configuration to determine the job’s length and when to stop, regardless of whether the job is complete.\nIf you have access to audit logs, you can see the cancellation request is sent by the Airflow operator, confirming the issue lies in the Airflow configuration rather than the Databricks job settings.\nAudit logs snippet\n{\"version\":\"2.0\",\"auditLevel\":\"WORKSPACE_LEVEL\",\"timestamp\":1746505827573,\"orgId\":\"\",\"shardName\":\"\",\"accountId\":\"xxxxxxxxxxxxxxxx\",\"sourceIPAddress\":\"\",\"userAgent\":\"databricks-airflow/6.7.0 _/0.0.0 python/3.11.11 os/linux airflow/2.9.3+astro.11 operator/DatabricksRunNowOperator\",\"sessionId\":null,\"userIdentity\":{\"email\":\"\",\"subjectName\":null},\"principal\":{\"resourceName\":\"accounts/xxxxxxxxxxxxxxxx\"/users/\",\"uniqueName\":\"\",\"contextId\":\"\",\"displayName\":\"Data Engineering\"},\"authorizeAs\":{\"resourceName\":\"accounts/xxxxxxxxxxxxxxxx\"/users/\",\"uniqueName\":\"\",\"displayName\":\"Data Engineering\",\"activatingResourceName\":null},\"serviceName\":\"jobs\",\"actionName\":\"cancel\",\"requestId\":\"\",\"requestParams\":{\"run_id\":\"\"},\"response\":{\"statusCode\":200,\"errorMessage\":null,\"result\":\"{}\"}}\nNote\nf you do not have audit logs configured for your workspace and you are on a premium plan or above, you can follow the instructions in the\nAudit log reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to configure them.\nSolution\nIncrease the job timeout threshold on the Airflow side.\nReview the Airflow Directed Acyclic Graph (DAG) that triggers the Databricks job. Look for the\nDatabricksRunNowOperator\ntask and check its configuration.\nAdjust the parameter in\nDatabricksRunNowOperator\nthat controls the timeout to a value beyond four hours.\nUpdate your Airflow DAG with the adjusted timeout parameter and deploy the changes.\nAfter updating the DAG, trigger a new run and monitor the job to ensure it runs beyond the previous four-hour limit without being terminated.\nFor more information, review the Airflow\nTasks\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-driver-failing-with-unexpected-stop-and-restart-message.json b/scraped_kb_articles/apache-spark-driver-failing-with-unexpected-stop-and-restart-message.json new file mode 100644 index 0000000000000000000000000000000000000000..7a6a0d7dde568eb81aeca3f680b9cb68891c73a4 --- /dev/null +++ b/scraped_kb_articles/apache-spark-driver-failing-with-unexpected-stop-and-restart-message.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/apache-spark-driver-failing-with-unexpected-stop-and-restart-message", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nTypically to accommodate a memory-intensive workload and avoid out-of-memory (OOM) errors, you scale up the cluster node’s memory.\nNote\nIf you’re looking for more information on scaling up, review the knowledge base article,\nSpark job fails with Driver is temporarily unavailable\n.\nAfter scaling up, you notice the driver still fails with an unexpected stop and restart message.\nThe spark driver has stopped unexpectedly and is restarting.\nWhile investigating, you notice a high frequency of garbage collection (GC) events which can be verified on the driver's log4j. ​\n24/11/07 00:32:45 WARN DBRDebuggerEventReporter: Driver/10.XX.XX.XX paused the JVM process 81 seconds during the past 120 seconds (67.76%) because of GC. We observed 3 such issue(s) since 2024-11-07T00:26:26.301Z.\nAdditionally, the driver's\nstdout\nmay show\nfull GC\nmessages.\n[Full GC (Metadata GC Threshold) [PSYoungGen: 38397K->0K(439808K)] [ParOldGen: 351239K->108115K(1019904K)] 389636K->108115K(1459712K), [Metaspace: 252946K->252852K(1290240K)], 46.2830875 secs] [Times: user=0.74 sys=0.76, real=46.28 secs]\nCause\nWhen scaling up a cluster’s memory doesn’t solve the memory issue, it rules out available memory and instead becomes a GC issue, which is often silent.\nGC causes the driver to pause Java virtual machine (JVM) applications. If there is more memory potentially available to use, the GC takes more time to scan all objects and free that memory. These long pauses can lead to a forced restart of the machine.\nSolution\nRefactor your code to use less memory at once. You can use the Apache Spark parallelization pipeline, which helps to distribute the load between the cluster's nodes and avoid memory issues.\nIf your workload doesn’t use the Spark API, Databricks recommends partitioning non-API execution types. Consider iterating over objects and reusing references instead of instantiating many heavy objects at the same time during processing." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway.json b/scraped_kb_articles/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway.json new file mode 100644 index 0000000000000000000000000000000000000000..cf6054f1a7933ff4b05a44bf0fcfa72cf46bd54b --- /dev/null +++ b/scraped_kb_articles/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou receive\nINFO\nstatements despite configuring the Apache Spark settings to suppress\nINFO\nand give specific\nWARN\nstatements. This issue is observed even after setting the '\npy4j'\nlogger to\nWARN\nand configuring the logging in the Spark config in the Databricks UI to\nWARN\n.\nThe problem persists, leading to an overflow of\nINFO\nlogs, which can be problematic when integrating with monitoring tools like DataDog.\nCause\nConfiguring Spark settings to suppress\nINFO\nlogs does not override the default\nlog4j2\nsettings in the Databricks cluster, which control logging behavior at a more granular level. These default\nlog4j2\nsettings may still allow\nINFO\nlog generation.\nAdditionally, the integration with DataDog may not respect the Spark configuration settings, leading to the continued generation of\nINFO\nlogs.\nSolution\nModify the\nlog4j2\nconfiguration file directly within the Databricks environment.\n1. Use an init script that updates the\nlog4j2.xml\nfile to suppress\nINFO\nlogs.\n#!/bin/bash\r\nset -e  # Exit script on any error\r\n# Define the log4j2 configuration file path (modify if needed)\r\nLOG4J2_PATH=\"/databricks/spark/dbconf/log4j/driver/log4j2.xml\"\r\n# Modify the log4j2 configuration file\r\necho \"Updating log4j2.xml to suppress INFO logs\"\r\nsed -i 's/level=\"INFO\"/level=\"WARN\"/g' $LOG4J2_PATH\r\necho \"Completed log4j2 config changes at `date`\"\n2. Upload the init script to the\nWorkspace Files\n. (You can create a\n.sh\nfile in the workspace files folder, add the contents of the script to the\n.sh\nfile and use the init script on the cluster)\n3. Configure the cluster to use the init script by setting it in the\nInit Scripts\ntab.\n\"destination\": \"Workspace\"\r\n\"/Users//log4j_warn.sh\"\n4. Restart the cluster to apply the changes." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error.json b/scraped_kb_articles/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error.json new file mode 100644 index 0000000000000000000000000000000000000000..e0d5e069bcfd195dfd4d7c9ee7a57e4b72496049 --- /dev/null +++ b/scraped_kb_articles/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to execute an Apache Spark job, it fails with a\nGC overhead limit exceeded\nerror.\njava.lang.OutOfMemoryError: GC overhead limit exceeded. The excessive GC triggered by join conditions results in the job being aborted with the error: Exception: Job aborted due to stage failure: Task in stage failed X times, most recent failure: Lost task in stage (TID) (executor): java.lang.OutOfMemoryError: GC overhead limit exceeded\nCause\nNULL values in join keys can result in a high number of unmatched rows, and a high number of unmatched rows in turn increases memory. Duplicate records in join keys can increase the amount of data that needs to be shuffled and sorted during a sort-merge join.\nThese factors can lead to excessive memory usage and extended garbage collection (GC) activity, which ultimately triggers the\njava.lang.OutOfMemoryError: GC overhead limit exceeded\nerror.\nSolution\nFirst, analyze join columns. Check for NULL values and duplicates in your join keys to identify potential sources of the error.\nSELECT COUNT(*) FROM WHERE IS NULL OR IS NULL;\r\nSELECT COUNT(*) FROM WHERE IS NULL OR IS NULL;\nThen, deduplicate join keys. Minimize data size during joins by removing duplicate rows based on the join keys.\ndf1 = df1.dropDuplicates([\"1\", \"\"])\r\ndf2 = df2.dropDuplicates([\"\", \"\"])" +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute.json b/scraped_kb_articles/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..a17a71d736d4870f6fe9ba449609c5de2c0cf997 --- /dev/null +++ b/scraped_kb_articles/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run a job on a dedicated compute, it fails with the following error.\nSparkException: Job aborted due to stage failure: Total size of serialized results of 2817 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.\nThe job fails even after increasing the\nspark.driver.maxResultSize\nand\ndriver memory\nto higher value.\nStacktrace\nat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\r\n\r\nat com.databricks.sql.execution.arrowcollect.RDDBatchCollector.runSparkJobs(RDDBatchCollector.scala:261)\r\n\r\nat com.databricks.sql.execution.arrowcollect.RDDBatchCollector.collect(RDDBatchCollector.scala:347)\r\n\r\nat com.databricks.sql.execution.arrowcollect.CloudStoreCollector$.hybridCollect(CloudStoreCollector.scala:159)\r\n\r\nat com.databricks.sql.execution.arrowcollect.CloudStoreCollector$.hybridCollect(CloudStoreCollector.scala:206)\r\n\r\nat org.apache.spark.sql.execution.qrc.CompressedHybridCloudStoreFormat.collect(cachedSparkResults.scala:170)\r\n\r\nat org.apache.spark.sql.execution.qrc.CompressedHybridCloudStoreFormat.collect(cachedSparkResults.scala:160)\r\n\r\nat org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.processAsRemoteBatches(SparkConnectPlanExecution.scala:475)\r\n\r\nat org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:141)\nCause\nWhen Fine-Grained Access Control (FGAC) is enabled, queries involving restricted data, such as those protected by row-level security, column masking, or secure views, are offloaded to serverless compute for enforcement. The resulting data must then be fully materialized and transferred back to the dedicated cluster’s driver.\nWhen the query spans a large number of small partitions, Apache Spark triggers an optimized execution path where executors send results directly to the driver, which aggregates and uploads them to cloud storage. If the total serialized result exceeds Spark’s internal 4 GiB driver-side limit, the job fails deterministically, regardless of driver memory or\nspark.driver.maxResultSize\nsettings.\nFor details, refer to the\nFine-grained access control on dedicated comput\ne (\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nExecute queries involving FGAC on standard compute, where data filtering and access control enforcement are handled within the same compute environment." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records.json b/scraped_kb_articles/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records.json new file mode 100644 index 0000000000000000000000000000000000000000..0373eaf6210b2ccff3b46f021b055e1f46c68fa3 --- /dev/null +++ b/scraped_kb_articles/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn your Apache Spark jobs, you notice some JSON files are processed incorrectly, leading to output containing only the first JSON object instead of all the records in the file.\nCause\nYou’re missing newline characters to separate each JSON record. Without them, Spark reads only the first JSON record from a file.\nExample of single-line JSON records without newline separators\n{\"col_1\":\"us-east-1\",\"col_3\":\"prod\"}{\"col_1\":\"us-east-2\",\"col_3\":\"dev\"}{\"col_1\":\"us-east-3\",\"col_3\":\"stage\"}\nExample of JSON records with newline separators\n{\"col_1\":\"us-east-1\",\"col_3\":\"prod\"}  \r\n{\"col_1\":\"us-east-2\",\"col_3\":\"dev\"}  \r\n{\"col_1\":\"us-east-3\",\"col_3\":\"stage\"}\nSolution\nAdd newline characters in your source file to separate each JSON object and make sure Spark can read them correctly.\nAlternatively, enable Photon runtime to leverage its ability to handle single-line JSON objects without newlines." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-jobs-fail-with-environment-directory-not-found-error.json b/scraped_kb_articles/apache-spark-jobs-fail-with-environment-directory-not-found-error.json new file mode 100644 index 0000000000000000000000000000000000000000..2aebf73206591c6bad9571e50bce548f1ae70af8 --- /dev/null +++ b/scraped_kb_articles/apache-spark-jobs-fail-with-environment-directory-not-found-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/apache-spark-jobs-fail-with-environment-directory-not-found-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter you install a Python library (via the cluster UI or by using\npip\n), your Apache Spark jobs fail with an\nEnvironment directory not found\nerror message.\norg.apache.spark.SparkException: Environment directory not found at\r\n/local_disk0/.ephemeral_nfs/cluster_libraries/python\nCause\nLibraries are installed on a Network File System (NFS) on the cluster's driver node. If any security group rules prevent the workers from communicating with the NFS server, Spark commands cannot resolve the Python executable path.\nSolution\nYou should make sure that your security groups are configured with appropriate security rules (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster.json b/scraped_kb_articles/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..7e03c703f4639e55fd422802a4968badf195d40c --- /dev/null +++ b/scraped_kb_articles/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using spot instances in your cluster, your Apache Spark jobs fail due to stage failures.\n\"org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 2923 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.\"\nCause\nSpot instances can be preempted, leading to the loss of nodes in the cluster. When nodes are lost, the shuffle map stage fails and Spark cannot rollback the\nResultStage\nto re-process input data.\nSolution\nUse on-demand nodes instead of spot instances. In the cluster configuration, navigate to the\nAdvanced\ntab and slide the slider to the extreme right to select on-demand nodes for workers.\nAlternatively, if you want to continue to use spot instances, you can decrease the chance of data loss by enabling Spark decommissioning. Decommissioning allows migration of data before spot node preemption.\nImportant\nDecommissioning is a best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor.\nTo decommission, add the following configurations to the cluster configuration under\nAdvanced options > Spark\n.\nspark.decommission.enabled true\r\nspark.storage.decommission.enabled true\r\nspark.storage.decommission.shuffleBlocks.enabled true\r\nspark.storage.decommission.rddBlocks.enabled true\r\n\r\nAdditionally, under Advanced options > Environment, add:\r\nSPARK_WORKER_OPTS=\"-Dspark.decommission.enabled=true\"" +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes.json b/scraped_kb_articles/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes.json new file mode 100644 index 0000000000000000000000000000000000000000..660fbecdb214f23c4aaf9ba354c01640ed98f6e5 --- /dev/null +++ b/scraped_kb_articles/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Apache Spark PySpark job using the following Python threading API function\nThreadPoolExecutor()\ntakes over an hour to complete, instead of minutes.\nwith ThreadPoolExecutor(max_workers=MAX_THREAD_NUM) as executor:\r\n    executor.map(thread_process_partition, cid_partitions)\nCause\nWhen using Python threads, the driver node becomes overwhelmed, leading to inefficient task distribution and underutilization of worker nodes.\nSolution\nUse the Databricks Spark connector instead of threading. For more information, review the\nConnect to external systems\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nEnsure that the cluster configuration is optimized for the workload. This includes adjusting the number of worker nodes and their specifications to match the job requirements. For cluster sizing guidance, review the\nCompute configuration recommendations\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nPreventative measures\nImplement best practices for Spark job optimization, such as caching intermediate results, using broadcast variables, and avoiding shuffles where possible. For more information, review the\nComprehensive Guide to Optimize Databricks, Spark and Delta Lake Workloads\n.\nMonitor and adjust the job's execution plan, using the Spark UI to identify and address any bottlenecks. Refer to the\nDebugging with the Apache Spark UI\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details." +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-submit-job-clusters-do-not-terminate-after-scstop.json b/scraped_kb_articles/apache-spark-submit-job-clusters-do-not-terminate-after-scstop.json new file mode 100644 index 0000000000000000000000000000000000000000..6c4a9c24827292ed3a68e345b8b7b3ba30c01c93 --- /dev/null +++ b/scraped_kb_articles/apache-spark-submit-job-clusters-do-not-terminate-after-scstop.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/apache-spark-submit-job-clusters-do-not-terminate-after-scstop", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Apache Spark Submit jobs remain active even after invoking\nsc.stop()\n, and the underlying job cluster does not shut down as expected.\nCause\nThe cluster still has non-daemon threads running, preventing shutdown.\nContext\nThe\nsc.stop()\ncommand in Spark stops the SparkContext, which is responsible for managing the resources for a Spark application. However, it does not necessarily terminate the job cluster.\nThe termination of a job cluster depends on various factors including the specific configurations and the termination policies set for the cluster.\nIn Spark Submit jobs, the job will not exit until the Spark Submit JVM shuts down. This shutdown normally occurs in one of two ways.\nExplicitly, when the JVM’s\nSystem.exit()\nis method explicitly invoked somewhere via code.\nWhen all non-daemon threads have exited. A non-daemon thread ensures that the JVM waits for its completion before exiting, making it suitable for critical tasks that must be finished before the application terminates. This type of thread is crucial for maintaining data integrity and ensuring that important operations are fully executed.\nSome non-daemon threads can be only stopped when\nSparkContext.stop()\nis called.\nSome non-daemon threads are only cleaned up in a JVM shutdown hook.\nSolution\nTo ensure the Spark Submit job terminates properly, explicitly invoke\nSystem.exit(0)\nafter\nSparkContext.stop()\n.\nPython\nimport sys\r\nsc = SparkSession.builder.getOrCreate().sparkContext  # Or otherwise obtain handle to SparkContext\r\nrunTheRestOfTheUserCode()\r\n# Fall through to exit with code 0 in case of success, since failure will throw an uncaught exception\r\n# and won't reach the exit(0) and thus will trigger a non-zero exit code that will be handled by\r\n# PythonRunner\r\nsc._gateway.jvm.System.exit(0)\nScala\ndef main(args: Array[String]): Unit = {\r\n  try {\r\n    runTheRestOfTheUserCode() // The actual application logic\r\n  } catch {\r\n    case t: Throwable =>\r\n      try {\r\n        // Log the throwable or error here\r\n      } finally {\r\n        System.exit(1)\r\n      }\r\n  }\r\n  System.exit(0)\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/apache-spark-ui-task-logs-intermittently-return-http-500-error.json b/scraped_kb_articles/apache-spark-ui-task-logs-intermittently-return-http-500-error.json new file mode 100644 index 0000000000000000000000000000000000000000..ee03451e09ea20d88d01a690b1cf3163a01d32ba --- /dev/null +++ b/scraped_kb_articles/apache-spark-ui-task-logs-intermittently-return-http-500-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/apache-spark-ui-task-logs-intermittently-return-http-500-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nUsers of Shared access mode clusters experience intermittent HTTP 500 errors when trying to view task logs in the Apache Spark UI. This also applies to admins.\nErrorCaused by:java.lang.Exception: Log viewing is disabled on this cluster\r\n    at org.apache.spark.deploy.worker.ui.LogPage.render(LogPage.scala:65)\r\n    at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:100)\r\n    at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:100)\r\n    at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)\r\n    at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)\r\n    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\r\n    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)\r\n    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)\r\n    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\r\n    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\r\n    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\r\n    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\r\n    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)\r\n    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\r\n    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n    at org.eclipse.jetty.server.Server.handle(Server.java:534)\r\n    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\r\n    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\nCause\nThis specific exception is controlled by the\nspark.databricks.ui.logViewingEnabled\nSpark property. When this value is set to\nfalse\n, log viewing is disabled. When Spark log viewing is disabled on the cluster, the Spark UI generates an error when you attempt to view the logs.\nThe\nspark.databricks.ui.logViewingEnabled\nproperty defaults to\ntrue\n, however sometimes other Spark configurations (such as\nspark.databricks.acl.dfAclsEnabled\n) can alter its value and set it to\nfalse\n.\nSolution\nSet\nspark.databricks.ui.logViewingEnabled\nto\ntrue\nin the cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nspark.databricks.ui.logViewingEnabled true\nThis restores the default configuration in case it is accidentally overwritten." +} \ No newline at end of file diff --git a/scraped_kb_articles/append-a-row-to-rdd-or-dataframe.json b/scraped_kb_articles/append-a-row-to-rdd-or-dataframe.json new file mode 100644 index 0000000000000000000000000000000000000000..927ec88693e3178f47ee16faab203ac108805bd9 --- /dev/null +++ b/scraped_kb_articles/append-a-row-to-rdd-or-dataframe.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/append-a-row-to-rdd-or-dataframe", + "title": "Título do Artigo Desconhecido", + "content": "To append to a DataFrame, use the\nunion\nmethod.\n%scala\r\n\r\nval firstDF = spark.range(3).toDF(\"myCol\")\r\nval newRow = Seq(20)\r\nval appended = firstDF.union(newRow.toDF())\r\ndisplay(appended)\n%python\r\n\r\nfirstDF = spark.range(3).toDF(\"myCol\")\r\nnewRow = spark.createDataFrame([[20]])\r\nappended = firstDF.union(newRow)\r\ndisplay(appended)" +} \ No newline at end of file diff --git a/scraped_kb_articles/append-output-not-supported-no-watermark.json b/scraped_kb_articles/append-output-not-supported-no-watermark.json new file mode 100644 index 0000000000000000000000000000000000000000..8537650b337d13e4a5d9134145bfae29ed0a99d4 --- /dev/null +++ b/scraped_kb_articles/append-output-not-supported-no-watermark.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/append-output-not-supported-no-watermark", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are performing an aggregation using append mode and an exception error message is returned.\nAppend output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark\nCause\nYou cannot use append mode on an aggregated DataFrame without a watermark. This is by design.\nSolution\nYou must apply a watermark to the DataFrame if you want to use append mode on an aggregated DataFrame.\nThe aggregation must have an event-time column, or a window on the event-time column.\nGroup the data by window and word and compute the count of each group.\n.withWatermark()\nmust be called on the same column as the timestamp column used in the aggregation. The example code shows how this can be done.\nReplace the value\n\nwith the type of element you are processing. For example, you would use Row if you are processing by row.\nReplace the value\n\nwith the streaming DataFrame of schema { timestamp: Timestamp, word: String }.\n%java\r\n\r\nDataset windowedCounts = \r\n    .withWatermark(\"timestamp\", \"10 minutes\")\r\n    .groupBy(\r\n        functions.window(words.col(\"timestamp\"), \"10 minutes\", \"5 minutes\"),\r\n        words.col(\"word\"))\r\n    .count();\n%python\r\n\r\nwindowedCounts = \\\r\n    .withWatermark(\"timestamp\", \"10 minutes\") \\\r\n    .groupBy(\r\n        window(words.timestamp, \"10 minutes\", \"5 minutes\"),\r\n        words.word) \\\r\n    .count()\n%scala\r\n\r\nimport spark.implicits._\r\n\r\nval windowedCounts = \r\n    .withWatermark(\"timestamp\", \"10 minutes\")\r\n    .groupBy(\r\n        window($\"timestamp\", \"10 minutes\", \"5 minutes\"),\r\n        $\"word\")\r\n    .count()\nYou must call\n.withWatermark()\nbefore you perform the aggregation. Attempting otherwise fails with an error message. For example,\ndf.groupBy(\"time\").count().withWatermark(\"time\", \"1 min\")\nreturns an exception.\nPlease refer to the Apache Spark documentation on\nconditions for watermarking to clean the aggregation slate\nfor more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables.json b/scraped_kb_articles/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..d4e6cecf6bc9ed9e3023ce386acee1b68071a1b0 --- /dev/null +++ b/scraped_kb_articles/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use\napplyInPandasWithState\nwith Delta Live Tables but execution fails with a\nModuleNotFoundError: No module named 'helpers'\nerror message.\nExample error\nTraceback (most recent call last):\r\n File \"/databricks/spark/python/pyspark/worker.py\", line 1964, in main\r\n   func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/worker.py\", line 1770, in read_udfs\r\n   arg_offsets, f = read_single_udf(\r\n                    ^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/worker.py\", line 802, in read_single_udf\r\n   f, return_type = read_command(pickleSer, infile)\r\n                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/worker_util.py\", line 70, in read_command\r\n   command = serializer._read_with_length(file)\r\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/serializers.py\", line 196, in _read_with_length\r\n   raise SerializationError(\"Caused by \" + traceback.format_exc())\r\npyspark.serializers.SerializationError: Caused by Traceback (most recent call last):\r\n File \"/databricks/spark/python/pyspark/serializers.py\", line 192, in _read_with_length\r\n   return self.loads(obj)\r\n          ^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/serializers.py\", line 572, in loads\r\n   return cloudpickle.loads(obj, encoding=encoding)\r\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\nModuleNotFoundError: No module named 'helpers'\nCause\nApplyInPandasWithState\ndoes not work correctly when used with Delta Live Tables if you define the function you want to use outside of your notebook.\nIn this case, we are trying to use the\ncount_fn\nfunction that is defined in the\nhelpers.streaming.functions\nmodule. It is imported at the start of the example code block and then called as part of\napplyInPandasWithState\n. This results in an error.\nExample code\n%python\r\n\r\nimport pandas as pd\r\nfrom pyspark.sql.functions import col\r\nfrom pyspark.sql.streaming.state import GroupStateTimeout\r\nfrom helpers.streaming.functions import count_fn\r\nfrom pyspark.sql.functions import udf\r\n\r\ndf = (\r\n    spark.readStream.format(\"rate\")\r\n    .option(\"rowsPerSecond\", \"100\")\r\n    .load()\r\n    .withColumn(\"id\", col(\"value\"))\r\n    .groupby(\"id\")\r\n    .applyInPandasWithState(\r\n        func=count_fn,\r\n        outputStructType=\"id long, countAsString string\",\r\n        stateStructType=\"len long\",\r\n        outputMode=\"append\",\r\n        timeoutConf=GroupStateTimeout.NoTimeout,\r\n    )\r\n)\r\nimport dlt\r\nimport time\r\n\r\n@dlt.table(name=f\"random_{int(time.time())}\")\r\ndef a():\r\n  return df\nSolution\nYou should define the function you want to use within the notebook, reimporting the function you want to call as part of your custom function. Call the function you defined and it completes as expected.\nThis custom function imports\ncount_fn\nand runs it. By adding this to the sample code, and calling\nmy_func\ninstead of calling\ncount_fn\ndirectly, the example code successfully completes.\ndef my_func(*args):\r\n    from helpers.streaming.functions import count_fn\r\n    return count_fn(*args)\nExample code\n%python\r\n\r\nimport pandas as pd\r\nfrom pyspark.sql.functions import col\r\nfrom pyspark.sql.streaming.state import GroupStateTimeout\r\nfrom helpers.streaming.functions import count_fn\r\nfrom pyspark.sql.functions import udf\r\n\r\ndef my_func(*args):\r\n    from helpers.streaming.functions import count_fn\r\n    return count_fn(*args)\r\n\r\ndf = (\r\n    spark.readStream.format(\"rate\")\r\n    .option(\"rowsPerSecond\", \"100\")\r\n    .load()\r\n    .withColumn(\"id\", col(\"value\"))\r\n    .groupby(\"id\")\r\n    .applyInPandasWithState(\r\n        func=my_func,\r\n        outputStructType=\"id long, countAsString string\",\r\n        stateStructType=\"len long\",\r\n        outputMode=\"append\",\r\n        timeoutConf=GroupStateTimeout.NoTimeout,\r\n    )\r\n)\r\nimport dlt\r\nimport time\r\n\r\n@dlt.table(name=f\"random_{int(time.time())}\")\r\ndef a():\r\n  return df" +} \ No newline at end of file diff --git a/scraped_kb_articles/arcgis-library-installation-fails-with-subprocess-exited-with-error.json b/scraped_kb_articles/arcgis-library-installation-fails-with-subprocess-exited-with-error.json new file mode 100644 index 0000000000000000000000000000000000000000..be4625e541992711a0c54db69dff82c7764e4ebd --- /dev/null +++ b/scraped_kb_articles/arcgis-library-installation-fails-with-subprocess-exited-with-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/arcgis-library-installation-fails-with-subprocess-exited-with-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to pip install the ArcGIS library on a cluster and get the following error.\nline 159, in _get_build_requires\r\n          self.run_setup()\r\n        File \"/databricks/python/lib/python3.9/site-packages/setuptools/build_meta.py\", line 174, in run_setup\r\n          exec(compile(code, __file__, 'exec'), locals())\r\n        File \"setup.py\", line 109, in \r\n          link_args = shlex.split(get_output(f\"{kc} --libs gssapi\"))\r\n        File \"setup.py\", line 22, in get_output\r\n          res = subprocess.check_output(*args, shell=True, **kwargs)\r\n        File \"/usr/lib/python3.9/subprocess.py\", line 424, in check_output\r\n          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,\r\n        File \"/usr/lib/python3.9/subprocess.py\", line 528, in run\r\n          raise CalledProcessError(retcode, process.args,\r\n      subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127.\r\n      [end of output]\r\n  \r\n  note: This error originates from a subprocess, and is likely not a problem with pip.\r\nerror: subprocess-exited-with-error\r\n\r\n× Getting requirements to build wheel did not run successfully.\r\n│ exit code: 1\r\n╰─> See above for output.\nCause\nThe ArcGIS package requires native components that link against Kerberos (GSSAPI). During installation, the ArcGIS package invokes\nkrb5-config\n, which is missing from the Databricks environment.\nkrb5-config\nis provided by the\nlibkrb5-dev\npackage, and its absence causes the installation to fail with a\nsubprocess-exited-with-error\n.\nSolution\nFrom the UI, create a cluster-scoped init script to install the required\nlibkrb5-dev\nbinary package. The script is stored in your workspace filesystem, though you may optionally store it in S3 or a Unity Catalog volume.\nFrom your workspace landing page, navigate to your home folder.\nClick\nCreate\nin the top-right corner of the page. Then select\nFile\n.\nAdd the following script to the editor and save the file as\narcgis_requirements.sh\n#!/bin/bash\r\n\r\n# Install all the subdependencies packages required for ArcGIS\r\n# Remove all cached package lists to ensure a fresh update.\r\nsudo rm -rf /var/lib/apt/lists/*\r\n# Update the local package index to get the latest package information.\r\nsudo apt-get -y update\r\n\r\n# Install required packages for Kerberos\r\nsudo apt install -y libkrb5-dev\nNext, attach the init script to your cluster.\nNavigate to\nCompute\n> your cluster and click\nEdit\nto edit the cluster.\nExpand the\nAdvanced options\nand click the\nInit scripts\ntab.\nIn the\nSource\ndrop-down, select\nWorkspace\n.\nIn the\nFile\npath\n, select the path to the script\narcgis_requirements.sh\nClick\nAdd\n, then\nConfirm\nand restart\n.\nLast, install ArcGIS from the cluster UI.\nOn the cluster page, click\nLibraries\n. Then click\nInstall new\n.\nUnder\nLibrary Source\n, select\nPyPi\n.\nProvide the package name as\narcgis==\nClick\nInstall\n.\nFor additional information on init scripts, review the\nCluster-scoped init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor additional information on ArcGIS dependencies, refer to the esri Developer\nSystem requirements\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error.json b/scraped_kb_articles/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error.json new file mode 100644 index 0000000000000000000000000000000000000000..89a852e0f37d3440abceb33d145e7781c11a5024 --- /dev/null +++ b/scraped_kb_articles/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to connect to SharePoint from a Databricks notebook, you encounter a 404 'Not Found' error, even though you have the correct permissions and a valid folder path.\nThis error specifically occurs when trying to access a file from SharePoint, for example,\nhttps://xxxx.sharepoint.com/\n, suggesting that the issue is more likely related to connectivity or URL formatting rather than permissions.\nCause\nYour code uses URL-encoded characters (for example, %20 for spaces) in the file path.\nWhile URL encoding is typically correct for web URLs, in the specific case of SharePoint integration with Databricks, the SharePoint API expects native filesystem-style paths rather than web-encoded URLs when accessing documents through the\nFile.open_binary\nmethod.\nSolution\nRemove URL encoding for spaces (%20) from the path.\nUse regular spaces in the file path.\nMaintain the relative path structure as is.\nExample\nThe following code shows an example of %20 for spaces, which causes the error.\nresponse = File.open_binary(ctx, \"/sites/itfinance/Shared%20Documents/Evolve%20IT%20Finance/Database/...\")\nThe corrected code uses spaces instead of of %20s.\npythonCopyresponse = File.open_binary(ctx, \"/sites/itfinance/Shared Documents/Evolve IT Finance/Database/...\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics.json b/scraped_kb_articles/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics.json new file mode 100644 index 0000000000000000000000000000000000000000..1b00d8d460b61f92b6c735841d6b329d9b41ea4f --- /dev/null +++ b/scraped_kb_articles/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use a Python script to retrieve your serving endpoint metrics and save them to a file in the Prometheus format, when you encounter an error.\nAttributeError: 'ExportMetricsResponse' object has no attribute 'metrics'\r\nsuggests that the ExportMetricsResponse object returned by w.serving_endpoints.export_metrics(name=endpoint_name) does not contain an attribute called metrics.\nCause\nYou can request metrics from the serving endpoint API in the Prometheus format, but if you are using alternate methods (such as a Python script to retrieve and save the data to a file and then use it in another job) the JSON response is not guaranteed to be in the Prometheus format.\nSolution\nYou can use an example Python script as a base to retrieve the metrics data from a serving endpoint, convert it to Prometheus format, and write it out to a file.\nExample code\nThis python script fetches the data from the\nserving endpoint metrics API\n(\nAWS\n|\nAzure\n|\nGCP\n) and writes the metrics from\nmetrics_output.prom\nout to a file.\nThe\nmetrics_output.prom\nfile can be read and referred to in any way you require. For example, you can ingest the response into\nsignalfx\n->\nprometheus-exporter\n.\nBefore running this example code, you need to replace:\n\n- URL of the workspace\n\n- Your PAT/OAuth token\n\n- The serving endpoint you want to collect metrics from\nimport requests\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\n# Initialize Databricks client\r\nworkspace = WorkspaceClient(host=\"\", token=\"\")\r\n\r\n# Define the endpoint URL\r\nendpoint = \"/api/2.0/serving-endpoints//metrics\"\r\n\r\n# Set headers with correct Content-Type\r\nheaders = {\r\n\"Authorization\": f\"Bearer {workspace.config.token}\",\r\n\"Content-Type\": \"text/plain; version=0.0.4; charset=utf-8\"\r\n}\r\n\r\n# Make the GET request\r\nresponse = requests.get(endpoint, headers=headers)\r\n\r\n# Check if the request was successful\r\nif response.status_code == 200:\r\n# Save response to a file\r\nwith open(\"metrics_output.prom\", \"w\") as file:\r\nfile.write(response.text)\r\n\r\n# Print the contents of the file\r\nwith open(\"metrics_output.prom\", \"r\") as file:\r\nprint(file.read())\r\nelse:\r\nprint(f\"Error: {response.status_code}, {response.text}\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally.json b/scraped_kb_articles/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally.json new file mode 100644 index 0000000000000000000000000000000000000000..d681257c1092931ab30f749d1201063ae5756f71 --- /dev/null +++ b/scraped_kb_articles/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you save your subnet information, you store it locally. When you later try to retrieve the subnet information again, you encounter an authorization error.\nThis Azure storage request is not authorized. The storage account's 'Firewalls and virtual networks' settings may be blocking access to storage services. Please verify your Azure storage credentials or firewall exception settings.\nCause\nSubnet information is dynamic because its serverless. Storing subnet information locally creates a static version, which becomes outdated when you try to use it later.\nSolution\nUse a script that leverages the Network Configuration Controller (NCC) API to fetch subnet information dynamically, and then allowlist the subnets. This approach integrates with serverless and automates Azure Data Lake Storage (ADLS) firewall allowlisting.\nCreate an NCC object.\nEndpoint:\nPOST https://accounts.azuredatabricks.net/api/2.0/accounts/{{accountId}}/network-connectivity-configs\nPermission: Databricks Account Admin\nPayload:\nname (), region ()\nAttach the NCC object to one or more workspaces.\nEndpoint:\nPATCH https://accounts.azuredatabricks.net/api/2.0/accounts/{{accountId}}/workspaces/{{}}\nPermission: Databricks Account Admin\nPayload:\nnetwork_connectivity_config_id ()\nObtain the NCC object details and record the subnet IDs.\nEndpoint:\nGET https://accounts.azuredatabricks.net/api/2.0/accounts/{{accountId}}/workspaces/{{workspaceId}}/network-connectivity-configs\nPermission: Databricks Workspace Admin\nAdd the subnet IDs to the firewall of the Azure storage accounts.\nEndpoint: (Azure CLI command)\naz storage account network-rule add\nParameters:\n--resource-group, --account-name, --subscription, --subnet\nVerify connectivity.\nNo API endpoint required. Verify by creating a serverless cluster and accessing the desired storage accounts." +} \ No newline at end of file diff --git a/scraped_kb_articles/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function.json b/scraped_kb_articles/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function.json new file mode 100644 index 0000000000000000000000000000000000000000..b27e4220ea1f5539a69cda5f9cfc0fc2a3cad114 --- /dev/null +++ b/scraped_kb_articles/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you upload files to a source location using an Azure function in Auto Loader, Auto Loader does not pick up the files for processing. The files are also not available in the queue, which Auto Loader sets automatically. The process does work, however, with manual intervention.\nCause\nAuto Loader listens for the ‘\nFlushWithClose\n’ event to process a file in file notification mode. The Azure function used for uploading files to a source location does not set the\nClose\nproperty in File Flush options, so EventBridge does not send the necessary event.\nSolution\nModify your Azure function to set the File Flush option with the\n'\nCLOSE\n' parameter.\nVerify that the\nFlushWithClose\nevent is generated in the Azure Queue after uploading the files to the source location.\nUse Directory listing mode until the file notification issue is resolved, then switch to file notification.\nMonitor the diagnostic logging on the storage Blob and Queue to identify any issues with the file upload process.\nFor more information, please review the\nWhat is Auto Loader file notification mode?\nand\nDataLakeFileFlushOptions.Close Property\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode.json b/scraped_kb_articles/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..75b55eedd85d6acf89dc6bce7233760fb4b3988b --- /dev/null +++ b/scraped_kb_articles/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou may encounter an issue where Auto Loader does not pick up new files in\ndirectory listing mode\n(\nAWS\n|\nAzure\n|\nGCP\n) in scenarios where the source\ncloudFiles\nfile naming convention has changed.\nCause\nThis is related to the way lexical ordering works when using directory listing mode in Auto Loader. New files with different naming conventions are not being recognized as new.\nExample\nThis Python example demonstrates a possible scenario that could occur.\n# List of sample filenames in source cloudfiles location\r\nfilenames = [\r\n    \"MYAPP_1970-01-01.parquet\",  # <-- older file\r\n    \"MYAPPX_1970-01-02.parquet\", # <-- newer file with slightly modified naming convention (added X character before the _)\r\n]\r\n# Sort the filenames lexicographically\r\nfor filename in sorted(filenames):\r\n    print(filename)\r\n#MYAPPX_1970-01-02.parquet # new file listed first (considered oldest)\r\n#MYAPP_1970-01-01.parquet  # old file listed last (considered newest)\nTo better understand how lexical ordering works, please review the\nLexical ordering of files\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nUse file notification mode\nUse\nfile notification mode\n(\nAWS\n|\nAzure\n|\nGCP\n) instead of directory listing mode. File notification mode is lower-latency, can be more cost-effective, and helps avoid lexical ordering issues.\nDisable incremental listing\nIf you cannot use file notification mode, you should disable incremental listing by setting the Apache Spark option\ncloudFiles.useIncrementalListing\nto\nfalse\n.\nThis allows new files to be picked up, although it may increase the time spent listing files.\nNote\nIncremental listing mode is deprecated and should not be used. For more information, review the\nIncremental Listing (deprecated)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor more information and best practices on using Auto Loader, review the\nConfigure Auto Loader for production workloads\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files.json b/scraped_kb_articles/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files.json new file mode 100644 index 0000000000000000000000000000000000000000..9cbfd5cc7db3669a5d8a8174d9988a91399235cf --- /dev/null +++ b/scraped_kb_articles/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice that jobs which previously ran successfully start failing without any recent changes to the code or environment. You discover subsequently that pipelines using Auto Loader to ingest Delta changes for multiple tables fail suddenly with an error message,\njava.io.FileNotFoundException\n, for SST and log files.\nCause\nYour table’s checkpoint and Delta paths are misconfigured. When the\nVACUUM\ncommand is executed on the table, it deletes all files in the directory that are not tracked by\n_delta_log\n, including the checkpoint files. This can happen if the checkpoint path is set to the same location as the Delta path of the table.\nThe issue can also occur if multiple streams or jobs use the same checkpoint directory, which leads to conflicts and file deletions that can cause pipeline failures.\nSolution\nFirst, create a separate checkpoint folder outside of the Delta directory. This ensures that the checkpoint files are not deleted during the\nVACUUM\noperation.\nNext, for successful runs, copy the checkpoint files to this new path. For failed runs, start with a new checkpoint and use the\nmodifiedAfter\noption to the stream to ingest files that have a modification timestamp after a specific timestamp.\nFor more detail on the\nmodifiedAfter\noption, refer to the\nAuto Loader options\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nLast, run a\nVACUUM DRY RUN\nto validate which files will be deleted in the next vacuum execution. This helps ensure that no necessary files are deleted." +} \ No newline at end of file diff --git a/scraped_kb_articles/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service.json b/scraped_kb_articles/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service.json new file mode 100644 index 0000000000000000000000000000000000000000..48d7d2a8aa28c655e5402f2dabdcebd6bb106d82 --- /dev/null +++ b/scraped_kb_articles/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Auto Loader in file notification mode, Auto Loader does not ingest your files to the target location as expected. This may happen even in situations where the referenced cloud queue service (AWS SQS, Azure Queue Storage, or Google Pub/Sub) contains messages with valid object storage URIs you expect Auto Loader to ingest. You receive\nWARN\nmessages in log4j.\nThe following is an example from AWS SQS.\nWARN S3Event: Ignoring unexpected message received in SQS queue\nExample configuration\nspark.readStream.format(\"cloudFiles\")\r\n  .option(\"cloudFiles.format\", \"json\")\r\n  .option(\"cloudFiles.useNotifications\", True)\r\n  .option(\"cloudFiles.queueUrl\", \"https://\")\r\n  .option(\"\",\"\")\r\n  ...\nCause\nMessages in your cloud queue service do not conform to the expected format. Messages that don’t comply with Auto Loader's expectations cause warnings like\nWARN S3Event: Ignoring unexpected message received in SQS queue\nfor AWS.\nSolution\nEnsure that your cloud queue messages conform to the expected format of the service consuming them.\nWhen using Auto Loader in AWS, the message format should comply with the\nObjectCreated\nevents in the AWS SQS Queue.\nAuto Loader on Azure expects messages that relate to the\nFlushWithClose\nevents.\nGCP expects messages that relate to the\nOBJECT_FINALIZE\nevent.\nIf the messages conform to the expected format, the Auto Loader stream should then progress as expected.  For further information on Auto Loader File Notification mode, refer to the\nWhat is Auto Loader file notification mode?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/auto-loader-streaming-job-failure-with-schema-inference-error.json b/scraped_kb_articles/auto-loader-streaming-job-failure-with-schema-inference-error.json new file mode 100644 index 0000000000000000000000000000000000000000..e03ee70eca436f57150dd2dbc0e5630a493f5867 --- /dev/null +++ b/scraped_kb_articles/auto-loader-streaming-job-failure-with-schema-inference-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/auto-loader-streaming-job-failure-with-schema-inference-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an Apache Spark streaming job using Auto Loader encounter an error stating:\nSchema inference for the 'parquet' format from the existing files in the input path has failed\nCause\nOne possible cause for this issue is having multiple types of files in the child directories.\nThe input directory structure includes a root folder containing nested directories such as folder A and folder B, each containing various file formats.\nRoot Folder -> Folder A -> Folder B -> Avro files (*.avro)\r\nRoot Folder -> Folder A -> Folder C -> Parquet files (*.parquet)\nSolution\nTo selectively read a specific type of file using Auto Loader from a directory with diverse file formats, use the\npathGlobFilter\noption.\nFor example, you can use\n.option(\"pathGlobfilter\", \"*.parquet\")\nto set a suffix pattern for Parquet files, ensuring that only Parquet files are processed.\nFor more information, review the\nFiltering directories or files using glob patterns\n(\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/auto-loader-streaming-query-failure-with-unknownfieldexception-error.json b/scraped_kb_articles/auto-loader-streaming-query-failure-with-unknownfieldexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..262c17897b149f2f1a9a88b845b0f56a98d6e081 --- /dev/null +++ b/scraped_kb_articles/auto-loader-streaming-query-failure-with-unknownfieldexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/auto-loader-streaming-query-failure-with-unknownfieldexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Auto Loader streaming job fails with an\nUnknownFieldException\nerror when a new column is added to the source file of the stream.\nException: org.apache.spark.sql.catalyst.util.UnknownFieldException: Encountered unknown field(s) during parsing: \nCause\nAn\nUnknownFieldException\nerror occurs when Auto Loader detects the addition of new columns as it processes incoming data.\nThe addition of a new column causes the stream to stop and generates an\nUnknownFieldException\nerror.\nSolution\nSet your Auto Loader stream to use schema evolution to avoid this issue.\nFor more information, review the\nHow does Auto Loader schema evolution work?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames.json b/scraped_kb_articles/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames.json new file mode 100644 index 0000000000000000000000000000000000000000..1f757672233635f2e556a84eba510fb380d45f4a --- /dev/null +++ b/scraped_kb_articles/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an Autoloader job configured in Directory listing mode and are encountering a failure with a\nURISyntaxException\nerror.\njava.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [masked_uri]\nCause\nThe error message indicates an issue with the URI (Uniform Resource Identifier) used in the Autoloader job configuration. This problem occurs when the source folder contains files with names containing colons (\"\n:\n\"). The Autoloader, in Directory listing mode, relies on the Hadoop library for file listing. Filenames with colons violate the library's naming limitations.\nFor more information on these limitations please review the\nHadoop Documentation\n.\nThe Apache community has acknowledged this issue in\nHDFS-14762\n.\nSolution\nYou have a few options to work around this naming limitation. Choose the most appropriate resolution based on your specific use case and requirements.\nAvoid filenames with colons:\nEnsure that filenames within the source path do not contain colons to comply with the Hadoop library naming constraints.\nSwitch to File notification mode:\nTransition from Directory listing mode to File notification mode in Autoloader.\nDisable incremental listing:\nIf Directory listing mode is required, disable incremental listing by setting\noption(\"cloudFiles.useIncrementalListing\", \"false\")\non\nreadStream\n. Note that this may degrade read performance.\nClear the checkpoint (temporary mitigation):\nIf the issue is infrequent, clearing the checkpoint may provide temporary relief. This should be considered a last resort,as it may result in duplicate processing in stateful streaming queries." +} \ No newline at end of file diff --git a/scraped_kb_articles/automate-vacuum-metrics-logging-for-delta-table-cleanup-audits.json b/scraped_kb_articles/automate-vacuum-metrics-logging-for-delta-table-cleanup-audits.json new file mode 100644 index 0000000000000000000000000000000000000000..5965cdd3d9afedaebe8d783e7ea086a653eedb64 --- /dev/null +++ b/scraped_kb_articles/automate-vacuum-metrics-logging-for-delta-table-cleanup-audits.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/automate-vacuum-metrics-logging-for-delta-table-cleanup-audits", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nAfter executing a\nVACUUM\noperation on a Delta table, you want to automate the extraction and logging of relevant metrics, such as the number of files deleted and the total size of data removed.\nInstructions\nParse the\noperationMetrics\nfrom the Delta transaction log using\nDESCRIBE HISTORY\n, and store this information in a dedicated Delta audit table for auditing or optimization validation purposes.\nFollow these steps to automate the logging of\nVACUUM\noperations performed in Delta tables.\nCreate a Delta audit table to store\nVACUUM\nmetrics like\nnumFilesToDelete\n,\nsizeOfDataToDelete\n, and\nnumDeletedFiles\n.\nRun the\nVACUUM\ncommand on the target Delta table.\nQuery\nDESC HISTORY\nfor\nVACUUM\noperations (\nVACUUM START\n,\nVACUUM END\n).\nTrack the maximum processed Delta version from the audit log.\nFilter new\nVACUUM\nentries using\nversion > max_logged_version\n.\nAppend only new rows to the audit table to avoid duplication.\nSample implementation snippet\nThe following code creates an audit table, performs the VACUUM operation on required tables, gets the max processed version, retrieves only the new history rows after the last logged version, selects desired audit fields, and writes the fields to the audit table.\n# Create audit table\r\nspark.sql(f\"\"\"\r\nCREATE TABLE IF NOT EXISTS .. (\r\ntable_name STRING,\r\nevent_time TIMESTAMP,\r\noperation STRING,\r\nuserName STRING,\r\nversion BIGINT,\r\nstatus STRING,\r\nnumFilesToDelete STRING,\r\nsizeOfDataToDelete STRING,\r\nnumDeletedFiles STRING\r\n) USING DELTA;\r\n\"\"\")\r\n\r\n\r\n\r\n# Perform the VACUUM operations on the required tables\r\nspark.sql(f\"VACUUM ..\")\r\n\r\n\r\n# Get max processed version\r\nlatest_logged_version = spark.sql(f\"\"\"\r\n  SELECT COALESCE(MAX(version), -1) as max_version\r\n  FROM ..\r\n  WHERE table_name = '..'\r\n\"\"\").collect()[0][\"max_version\"]\r\n\r\n# Get only new history rows after last logged version\r\nhistory_df = spark.sql(f\"\"\"\r\n  SELECT \r\n    timestamp AS event_time,\r\n    operation,\r\n    operationParameters,\r\n    operationMetrics,\r\n    userName,\r\n    version\r\n  FROM (DESC HISTORY `..`)\r\n  WHERE operation IN ('VACUUM START', 'VACUUM END')\r\n    AND version > {latest_logged_version}\r\n\"\"\")\r\n\r\n# Select desired audit fields\r\nmetrics_df = history_df.select(\r\n    lit(\"..\").alias(\"table_name\"),\r\n    col(\"event_time\"),\r\n    col(\"operation\"),\r\n    col(\"userName\"),\r\n    col(\"version\"),\r\n    col(\"operationParameters\")[\"status\"].alias(\"status\"),\r\n    col(\"operationMetrics\")[\"numFilesToDelete\"].alias(\"numFilesToDelete\"),\r\n    col(\"operationMetrics\")[\"sizeOfDataToDelete\"].alias(\"sizeOfDataToDelete\"),\r\n    col(\"operationMetrics\")[\"numDeletedFiles\"].alias(\"numDeletedFiles\")\r\n)\r\n\r\n# Write to audit table\r\nmetrics_df.write.mode(\"append\").format(\"delta\").saveAsTable(\"..\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/automatic-vacuum-on-write-does-not-work-with-non-delta-tables.json b/scraped_kb_articles/automatic-vacuum-on-write-does-not-work-with-non-delta-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..700ad348543b3baa102cd03a188273e7c827fa72 --- /dev/null +++ b/scraped_kb_articles/automatic-vacuum-on-write-does-not-work-with-non-delta-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/automatic-vacuum-on-write-does-not-work-with-non-delta-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have non-Delta tables and Databricks is not running\nVACUUM\nautomatically. Symptoms may include the presence of uncommitted files older than the retention threshold. Tables should be vacuumed on the write operation, but uncommitted files are not removed as expected.\nCause\nWhen using non-delta tables,\nVACUUM\nautomatically runs at the end of every job and only cleans directories that the particular Apache Spark job touches. If an operation runs on a specific partition,\nVACUUM\nonly affects that partition directory, rather than the whole table.\nSolution\nYou should manually run\nVACUUM\nto clear uncommitted files from the entire table.\nIdentify the table and partitions that contain dirty data.\nRun a manual\nVACUUM\non the entire table to remove uncommitted files that are older than the retention threshold. The default threshold is 7 days, but it can be adjusted as needed.\nVACUUM [table_name] RETAIN [number] HOURS;\nFor example, to\nVACUUM\na table named\n.\nand retain files for 1 hour, use:\nVACUUM . RETAIN 1 HOURS;\nFor more information, please review the\nVACUUM\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/autotermination-disabled-error-creating-job.json b/scraped_kb_articles/autotermination-disabled-error-creating-job.json new file mode 100644 index 0000000000000000000000000000000000000000..f22ef2c1c29996c037756439d84ec1988acc6aa5 --- /dev/null +++ b/scraped_kb_articles/autotermination-disabled-error-creating-job.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/autotermination-disabled-error-creating-job", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to start a job cluster, but the job creation fails with an error message.\nError creating job\r\nCluster autotermination is currently disabled.\nCause\nJob clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination policies.\nIf you include\nautotermination_minutes\nin your cluster policy JSON, you get the error on job creation.\n{\r\n \"autotermination_minutes\": {\r\n  \"type\": \"fixed\",\r\n   \"value\": 30,\r\n   \"hidden\": true\r\n  }\r\n}\nSolution\nDo not define\nautotermination_minutes\nin the cluster policy for job clusters.\nAuto termination should only be used for all-purpose clusters." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-container-does-not-exist-error-when-creating-files-in-a-workspace.json b/scraped_kb_articles/azure-container-does-not-exist-error-when-creating-files-in-a-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..f4e9c8720cbf566681d13a84fd2d7159ff4c38c3 --- /dev/null +++ b/scraped_kb_articles/azure-container-does-not-exist-error-when-creating-files-in-a-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/azure-container-does-not-exist-error-when-creating-files-in-a-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to create files in a Databricks workspace, you receive the error:\n'Request failed for POST //. Cause: The Azure container does not exist'.\nNote\nThis issue occurs in the Azure Databricks environment and affects the ability to create normal files, but\nnot\nthe ability to create notebooks.\nCause\nThere are three possible causes.\nThe Azure container might not have been properly created when the workspace was provisioned.\nThe Azure container might have been deleted or moved after it was created.\nThere might be a problem with the permissions or roles assigned to the user or service principal trying to access the Azure container.\nSolution\nFirst, confirm the Azure container exists in the specified resource group and subscription. If it has been deleted or moved, recreate it in the correct location.\nThen ensure the user or service principal has the necessary permissions to access and create files in the Azure container. Adjust the roles and permissions as needed.\nIf the workspace is newly deployed and the issue persists, consider deleting and recreating the workspace. This can resolve issues caused by errors during the initial provisioning process." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-core-limit.json b/scraped_kb_articles/azure-core-limit.json new file mode 100644 index 0000000000000000000000000000000000000000..6bed2f676f9f4ff4315633a6815c5e2a98aaa70d --- /dev/null +++ b/scraped_kb_articles/azure-core-limit.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/azure-core-limit", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nCluster creation fails with a message about a cloud provider error when you hover over cluster state.\nCloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster.\nWhen you\nview the cluster event log\nto get more details, you see a message about core quota limits.\nOperation results in exceeding quota limits of Core. Maximum allowed: 350, Current in use: 350, Additional requested: 4.\nCause\nAzure subscriptions have a CPU core quota limit which restricts the number of CPU cores you can use. This is a hard limit. If you try to start a cluster that would result in your account exceeding the CPU core quota the cluster launch will fail.\nSolution\nYou can either free up resources or request a quota increase for your account.\nStop inactive clusters to free up CPU cores for use.\nOpen an Azure support case with a request to increase the CPU core quota limit for your subscription." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-ip-limit.json b/scraped_kb_articles/azure-ip-limit.json new file mode 100644 index 0000000000000000000000000000000000000000..f0e86cb46b23e67bd27a3c105a10e0a81ddedbc9 --- /dev/null +++ b/scraped_kb_articles/azure-ip-limit.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/azure-ip-limit", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nCluster creation fails with a message about a cloud provider error when you hover over cluster state.\nCloud Provider Launch Failure: A cloud provider error was encountered while setting up the cluster.\nWhen you\nview the cluster event log\nto get more details, you see a message about\npublicIPAddresses\nlimits.\nResourceQuotaExceeded Azure error message: Creating the resource of type 'Microsoft.Network/publicIPAddresses' would exceed the quota of '800' resources of type 'Microsoft.Network/publicIPAddresses' per resource group. The current resource count is '800', please delete some resources of this type before creating a new one.'\nYou may also see an error about core address limits.\nOperation results in exceeding quota limits of Core. Maximum allowed: 350, Current in use: 350, Additional requested: 4.\nCause\nAzure subscriptions have a public IP address limit which restricts the number of public IP addresses you can use. This is a hard limit. If you try to start a cluster that would result in your account exceeding the public IP address quota the cluster launch will fail.\nSolution\nYou can either free up resources or request a quota increase for your account.\nStop inactive clusters to free up public IP addresses for use.\nOpen an Azure support case with a request to increase the quota limit for your subscription.\nFor more information, please review the following documentation:\nView quotas\nIncrease compute quotas:\nVM-family vCPU quotas\n,\nvCPU quotas by region\n, and\nspot vCPU quotas\n.\nIncrease networking quotas" +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-nodes-not-acquired.json b/scraped_kb_articles/azure-nodes-not-acquired.json new file mode 100644 index 0000000000000000000000000000000000000000..15890884e998660f68a1ad5dbae8fdb0741e05e3 --- /dev/null +++ b/scraped_kb_articles/azure-nodes-not-acquired.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/azure-nodes-not-acquired", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA cluster takes a long time to launch and displays an error message similar to the following:\nCluster is running but X nodes could not be acquired\nCause\nProvisioning an Azure VM typically takes 2-4 minutes, but if all the VMs in a cluster cannot be provisioned at the same time, cluster creation can be delayed. This is due to Azure Databricks having to reissue VM creation requests over a period of time.\nSolution\nIf a cluster launches without all of the nodes, Azure Databricks automatically tries to acquire the additional nodes and will update the cluster once available.\nTo workaround this, you should configure a cluster with a bigger instance type and a smaller number of nodes." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-spark-shuffle-fetch-fail.json b/scraped_kb_articles/azure-spark-shuffle-fetch-fail.json new file mode 100644 index 0000000000000000000000000000000000000000..ca8ca65c676aab7ffdf462b29a154b1bed28ec20 --- /dev/null +++ b/scraped_kb_articles/azure-spark-shuffle-fetch-fail.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/azure-spark-shuffle-fetch-fail", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are seeing intermittent Apache Spark job failures on jobs using shuffle fetch.\n21/02/01 05:59:55 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 10.79.1.45, executor 0): FetchFailed(BlockManagerId(1, 10.79.1.134, 4048, None), shuffleId=1, mapId=0, reduceId=0, message=\r\norg.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.79.1.134:4048\r\nat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:553)\r\nat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:484)\r\nat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:63)\r\n... 1 more\r\nCaused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.79.1.134:4048\r\nat sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\nCause\nThis can happen if you have modified the Azure Databricks subnet CIDR range after deployment. This behavior is not supported.\nAssume the below details describe two scenarios:\nOriginal Azure Databricks subnet CIDR\nPrivate subnet: 10.10.0.0/24 (10.10.0.0 - 10.10.0.255)\nPublic subnet: 10.10.1.0/24 (10.10.1.0 - 10.10.1.255)\nModified Azure Databricks subnet CIDR\nPrivate subnet: 10.10.0.0/18 (10.10.0.0 - 10.10.63.255)\nPublic subnet: 10.10.64.0/24 (10.10.64.0 - 10.10.127.255)\nWith the original settings, everything works as intended.\nWith the modified settings, if executors are assigned IP addresses in the subnet range 10.10.1.0 - 10.10.63.255 and the driver assigned an IP address in the subnet range 10.10.0.0 - 10.10.0.255, the communication between executors is blocked due to a firewall rule limiting communication in the original CIDR range of 10.10.0.0/24.\nIf the executors and driver are both assigned IP addresses in 10.10.0.0/24, no communication is blocked and the job runs as intended. However, this assignment is not guaranteed under the modified settings.\nSolution\nRevert any subnet CIDR changes and restore the original VNet configuration that you used to create the Azure Databricks workspace.\nRestart your cluster.\nResubmit your jobs." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-ssh-cluster-driver-node.json b/scraped_kb_articles/azure-ssh-cluster-driver-node.json new file mode 100644 index 0000000000000000000000000000000000000000..9c892bd07166f8a5cdae3ccea0ce23d17048d47f --- /dev/null +++ b/scraped_kb_articles/azure-ssh-cluster-driver-node.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/azure-ssh-cluster-driver-node", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to use SSH to connect to an Apache Spark driver node for advanced troubleshooting and installing custom software.\nDelete\nWarning\nYou can only use SSH if your workspace is deployed in an Azure Virtual Network (VNet) under your control. If your workspace is NOT VNet injected, the SSH option will not appear. Additionally, NPIP workspaces do not support SSH.\nConfigure an Azure network security group\nThe network security group associated with your VNet must allow SSH traffic. The default port for SSH is 2200. If you are using a custom port, you should make note of it before proceeding. You also have to identify a traffic source. This can be a single IP address, or it can be an IP range that represents your entire office.\nIn the Azure portal, find the network security group. The network security group name can be found in the public subnet.\nEdit the inbound security rules to allow connections to the SSH port. In this example, we are using the default port.\nDelete\nInfo\nMake sure that your computer and office firewall rules allow you to send TCP traffic on the port you are using for SSH. If the SSH port is blocked at your computer or office firewall, you cannot connect to the Azure VNet via SSH.\nGenerate SSH key pair\nOpen a local terminal.\nCreate an SSH key pair by running this command:\nssh-keygen -t rsa -b 4096 -C\nDelete\nInfo\nYou must provide the path to the directory where you want to save the public and private key. The public key is saved with the extension .pub.\nConfigure a new cluster with your public key\nCopy the ENTIRE contents of the public key file.\nOpen the cluster configuration page.\nClick\nAdvanced Options\n.\nClick the\nSSH\ntab.\nPaste the ENTIRE contents of the public key into the\nPublic key\nfield.\nContinue with cluster configuration as normal.\nConfigure an existing cluster with your public key\nIf you have an existing cluster and did not provide the public key during cluster creation, you can inject the public key from a notebook.\nOpen any notebook that is attached to the cluster.\nCopy the following code into the notebook, updating it with your public key as noted:\n%scala\r\n\r\nval publicKey = \"\"\r\n\r\ndef addAuthorizedPublicKey(key: String): Unit = {\r\n  val fw = new java.io.FileWriter(\"/home/ubuntu/.ssh/authorized_keys\", /* append */ true)\r\n  fw.write(\"\\n\" + key)\r\n  fw.close()\r\n}\r\naddAuthorizedPublicKey(publicKey)\nRun the code block to inject the public key.\nSSH into the Spark driver\nOpen the cluster configuration page.\nClick\nAdvanced Options\n.\nClick the\nSSH\ntab.\nNote the\nDriver Hostname\n.\nOpen a local terminal.\nRun the following command, replacing the hostname and private key file path:\nssh ubuntu@ -p 2200 -i " +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-storage-gives-error-authorizationfailure-while-using-mlflow.json b/scraped_kb_articles/azure-storage-gives-error-authorizationfailure-while-using-mlflow.json new file mode 100644 index 0000000000000000000000000000000000000000..270c5166f75f1b2399ca873a34bf0ecdf7b0a79c --- /dev/null +++ b/scraped_kb_articles/azure-storage-gives-error-authorizationfailure-while-using-mlflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/azure-storage-gives-error-authorizationfailure-while-using-mlflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn Azure Databricks, you can store a model registered in Unity Catalog within an external storage account. Databricks provides a hosted version of the MLflow Model Registry as part of Unity Catalog and you can load models from this registry using the MLflow client.\nHowever, when the storage account uses a Private Link connection, you receive an authorization error while loading the model.\nExample\nIn the following Python code example, a model is already registered in Unity Catalog with the alias\nModel_Alias\n. After importing the MLflow library, the client is configured to access models in Unity Catalog and load the model using the\nmlflow.pyfunc.load_model()\nfunction.\nimport mlflow\r\nmlflow.set_registry_uri(\"databricks-uc\")  #configure MLflow to access models in Unity Catalog\r\nmodel_version_uri = \"models:/catalog.schema.model_name@Model_Alias\"\r\nmodel_version = mlflow.pyfunc.load_model(model_version_uri)  #load the model from registry\nThis results in the following error.\nRequestId: 83eacd9-d01e-00c-20ef-25a202000000\r\nTime: 2024-10-24T05:50:08.7165446Z\r\n2024/10/24 05:50:08 INFO mflow.store.artifact.cloud_artifact_repo: Failed to complete request, possibly due to credential expiration. Refreshing credentials and trying again…(Error: This request is not authorized to perform this operation.) \r\nErrorCode: AuthorizationFailure\nCause\nThe Azure Storage account only has a DFS private endpoint, but both DFS and Blob storage endpoints are required for proper authorization.\nSolution\nAdd private endpoints for both\nDFS\nand\nblob\nservices on the Azure Storage account hosting the model. This ensures the ABFS driver can authenticate and perform read and write operations correctly.\nIn the\nAzure Portal\n, go to the\nAzure Storage\naccount hosting the model.\nUnder\nNetworking > Private Endpoint Connections\n, create two private endpoints with the following sub-resources:\nTarget sub-resource:\nBlob\nTarget sub-resource:\nDFS\nVerify the endpoints resolve correctly within the Databricks workspace by executing the following shell commands in a notebook.\n%sh\r\nnslookup storage_account_name.blob.core.windows.net\r\nnslookup storage_account_name.dfs.core.windows.net\nFor more information, refer to Microsoft’s\nUse private endpoints for Azure Storage\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-storage-throttling.json b/scraped_kb_articles/azure-storage-throttling.json new file mode 100644 index 0000000000000000000000000000000000000000..1a0a2fdd70ec3ef17ac7f34072bced26a81765a7 --- /dev/null +++ b/scraped_kb_articles/azure-storage-throttling.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/azure-storage-throttling", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen accessing data stored on Azure data Lake Storage (ADLS) Windows Azure Storage Blobs (WASB) requests start timing out. You may see an error message indicating that storage is being accessed at too high a rate.\nFiles and folders are being created at too high a rate\nCause\nAzure storage subscriptions have a limit on how many files and folders can be accessed over time. If too many requests are made in a given time frame, your account will be subject to throttling in order to keep the requests under the subscription limit.\nSolution\nTo resolve this issue you can either increase the storage limits on your Azure subscription or optimize your Spark code to reduce the number of files created." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-synapse-analytics-formerly-sql-data-warehouse-federated-queries-throwing-invalid-object-name-dbodate-error-.json b/scraped_kb_articles/azure-synapse-analytics-formerly-sql-data-warehouse-federated-queries-throwing-invalid-object-name-dbodate-error-.json new file mode 100644 index 0000000000000000000000000000000000000000..7fdeb935412fafc2915b2c819b5b037365770191 --- /dev/null +++ b/scraped_kb_articles/azure-synapse-analytics-formerly-sql-data-warehouse-federated-queries-throwing-invalid-object-name-dbodate-error-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/azure-synapse-analytics-formerly-sql-data-warehouse-federated-queries-throwing-invalid-object-name-dbodate-error-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the Databricks Lakehouse Federation feature to fetch data stored in Azure Synapse Analytics (formerly SQL Data Warehouse) – specifically when running a select query against tables under the schema – you receive an error message stating\nInvalid object name 'dbo.date\n'\neven though the table metadata is visible via the Catalog UI.\nCause\nThis issue arises when the collation for the database is set to\n“\nJapanese_BIN2”\nin the dedicated SQL pool (formerly SQL DW). In this case, Azure Synapse Analytics treats the query pushdown request as case-sensitive. Consequently, if the original table name in the source has a combination of uppercase and lowercase letters, the table name is loaded as lowercase in the Lakehouse Federation.\nSolution\nSet the collation for the database in the dedicated SQL pool to the default,\n\"\nSQL_Latin1_General_CP1_CI_AS\n\"\n.\nIf changing the collation is not feasible, ensure that all table names in the source database are in lowercase to match the case used in the Lakehouse Federation.\nFor more information, please review query federation limitations in the\nWhat is Lakehouse Federation?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-throttling.json b/scraped_kb_articles/azure-throttling.json new file mode 100644 index 0000000000000000000000000000000000000000..a74992c8d361de8cd4638c8c82d5f6e891761251 --- /dev/null +++ b/scraped_kb_articles/azure-throttling.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/azure-throttling", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you run a job that involves creating files in Azure Data Lake Storage (ADLS), either Gen1 or Gen2, the following exception occurs:\nCaused by: java.io.IOException: CREATE failed with error 0x83090c25 (Files and folders are being created at too high a rate). [745c5836-264e-470c-9c90-c605f1c100f5] failed with error 0x83090c25 (Files and folders are being created at too high a rate). [2019-04-12T10:06:43.1117197-07:00] [ServerRequestId:745c5836-264e-470c-9c90-c605f1c100f5]\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getRemoteException(ADLStoreClient.java:1191)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1154)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.createFile(ADLStoreClient.java:281)\r\nat com.databricks.adl.AdlFileSystem.create(AdlFileSystem.java:348)\r\nat com.databricks.spark.metrics.FileSystemWithMetrics.create(FileSystemWithMetrics.scala:280)\r\nat com.databricks.backend.daemon.data.client.DatabricksFileSystemV2$$anonfun$create$1$$anonfun$apply$10$$anonfun$apply$11.apply(DatabricksFileSystemV2.scala:483)\nCause\nEach ADLS subscription level has a limit on the number of files that can be created per unit of time, although the limits may differ depending on whether you are using ADLS Gen1 or Gen2. When the limit is exceeded, file creation is throttled, and the job fails.\nPotential causes for this error include:\nYour application creates a large number of small files.\nExternal applications create a large number of files.\nThe current limit for the subscription is too low.\nSolution\nIf your application or an external application is generating a large number of files, then you need to optimize the application. If the limit on your current subscription is not appropriate for your use case, then contact the Microsoft Azure Team for assistance." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-vnet-gen1-issue.json b/scraped_kb_articles/azure-vnet-gen1-issue.json new file mode 100644 index 0000000000000000000000000000000000000000..b3027d52afb9f4b1ced310406d472966e0f24dd0 --- /dev/null +++ b/scraped_kb_articles/azure-vnet-gen1-issue.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/azure-vnet-gen1-issue", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAccess to Azure Data Lake Storage Gen1 (ADLS Gen1) fails with\nADLException: Error getting info for file \nwhen the following network configuration is in place:\nAzure Databricks workspace is deployed in your own virtual network (uses VNet injection).\nTraffic is allowed via Azure Data Lake Storage credential passthrough.\nADLS Gen1 storage firewall is enabled.\nAzure Active Directory (Azure AD) service endpoint is enabled for the Azure Databricks workspace’s virtual network.\nCause\nAzure Databricks uses a control plane located in its own virtual network, and the control plane is responsible for obtaining a token from Azure AD. ADLS credential passthrough uses the control plane to obtain Azure AD tokens to authenticate the interactive user with ADLS Gen1.\nWhen you deploy your Databricks workspace in your own virtual network (using VNet injection), Azure Databricks clusters are created in your own virtual network. For increased security, you can restrict access to the ADLS Gen 1 account by configuring the ADLS Gen1 firewall to allow only requests from your own virtual network, by implementing service endpoints to Azure AD.\nHowever, ADLS credential passthrough fails in this case. The reason is that when ADLS Gen1 checks for the virtual network where the token was created, it finds the network to be the Azure Databricks control plane and not the customer-provided virtual network where the original passthrough call was made.\nSolution\nTo use ADLS credential passthrough with a service endpoint, storage firewall, and ADLS Gen1, enable\nAllow access to Azure services\nin the\nfirewall settings\n.\nIf you have security concerns about enabling this setting in the firewall, you can upgrade to ADLS Gen2. ADLS Gen2 works with the network configuration described above.\nFor more information, see:\nDeploy Azure Databricks in your Azure virtual network (VNet injection)\nAccess Azure Data Lake Storage using Azure Active Directory credential passthrough" +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-vnet-jobs-not-progressing.json b/scraped_kb_articles/azure-vnet-jobs-not-progressing.json new file mode 100644 index 0000000000000000000000000000000000000000..46393473761574f9f0b58a88670d47b8428cbae9 --- /dev/null +++ b/scraped_kb_articles/azure-vnet-jobs-not-progressing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/azure-vnet-jobs-not-progressing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nJobs fail to run on any cluster in the workspace.\nCause\nThis can happen if you have changed the VNet of an existing workspace. Changing the VNet of an existing Azure Databricks workspace is not supported.\nReview\nDeploy Azure Databricks in your Azure virtual network (VNet injection)\nfor more details.\nSolution\nOpen the cluster driver logs in the Azure Databricks UI.\nSearch for the following WARN messages:\n19/11/19 16:50:29 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\r\n19/11/19 16:50:44 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\r\n19/11/19 16:50:59 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\nIf this error is present, it is likely that the VNet of the Azure Databricks workspace was changed.\nRevert the change to restore the original VNet configuration that was used when the Azure Databricks workspace was created.\nRestart the running cluster.\nResubmit your jobs.\nVerify the jobs are getting resources." +} \ No newline at end of file diff --git a/scraped_kb_articles/azure-vnet-single-ip.json b/scraped_kb_articles/azure-vnet-single-ip.json new file mode 100644 index 0000000000000000000000000000000000000000..4ccaff8d95acb2777d561329bcc3c0a7000f89be --- /dev/null +++ b/scraped_kb_articles/azure-vnet-single-ip.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/azure-vnet-single-ip", + "title": "Título do Artigo Desconhecido", + "content": "You can use an Azure Firewall to create a\nVNet-injected workspace\nin which all clusters have a single IP outbound address. The single IP address can be used as an additional security layer with other Azure services and applications that allow access based on specific IP addresses.\n1. Set up an Azure Databricks Workspace in your own virtual network.\n2. Set up a firewall within the virtual network. See\nCreate an NVA\n. When you create the firewall, you should:\nNote both the private and public IP addresses for the firewall for later use.\nCreate a network rule\nfor the public subnet to forward all traffic to the internet:\nName: any arbitrary name\nPriority: 100\nProtocol: Any\nSource Addresses: IP range for the public subnet in the virtual network that you created\nDestination Addresses: 0.0.0.0/1\nDestination Ports: *\n3. Create a Custom Route Table and associate it with the public subnet.\na. Add custom routes, also known as user-defined routes (\nUDR\n) for the following services. Specify the\nAzure Databricks region addresses\nfor your region. For\nNext hop type\n, enter\nInternet\n, as shown in\ncreating a route table\n.\nControl Plane NAT VIP\nWebapp\nMetastore\nArtifact Blob Storage\nLogs Blob Storage\nb. Add a custom route for the firewall with the following values:\nAddress prefix: 0.0.0.0./0\nNext hop type: Virtual appliance\nNext hop address: The private IP address for the firewall.\nc. Associate the route table with the public subnet.\n4. Validate the setup\nCreate a cluster in the Azure Databricks workspace.\nNext, query blob storage to your own paths or run\n%fs ls\nin a cell.\nIf it fails, confirm that the route table has all required UDRs (including Service Endpoint instead of the UDR for Blob Storage)\nFor more information, see\nRoute Azure Databricks traffic using a virtual appliance or firewall\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/backfill-delta-table-cols.json b/scraped_kb_articles/backfill-delta-table-cols.json new file mode 100644 index 0000000000000000000000000000000000000000..228e64a5f4a3a79c9532f0075c514deaf27459f2 --- /dev/null +++ b/scraped_kb_articles/backfill-delta-table-cols.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/backfill-delta-table-cols", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file.\nSolution\nIn this example, there is a\ncustomers\ntable, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format.\nCreate a\nDataFrame\nfrom the Parquet file using an Apache Spark API statement:\n%python\r\n\r\nupdatesDf = spark.read.parquet(\"/path/to/raw-file\")\nView the contents of the\nupdatesDF DataFrame\n:\n%python\r\n\r\ndisplay(updatesDf)\nCreate a table from the\nupdatesDf DataFrame\n. In this example, it is named\nupdates\n.\n%python\r\n\r\nupdatesDf.createOrReplaceTempView(\"updates\")\nCheck the contents of the updates table, and compare it to the contents of\ncustomers\n:\n%python\r\n\r\ndisplay(customers)\nUse the\nMERGE INTO\nstatement to merge the data from the\nupdates\ntable into the original\ncustomers\ntable.\n%sql\r\n\r\nMERGE INTO customers\r\nUSING updates\r\nON customers.customerId = source.customerId\r\nWHEN MATCHED THEN\r\n  UPDATE SET address = updates.address\r\nWHEN NOT MATCHED\r\n  THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)\nHere,\ncustomers\nis the original Delta table that has an\naddress\ncolumn with missing values.\nupdates\nis the table created from the\nDataFrame updatesDf\n, which is created by reading data from the raw file. The\naddress\ncolumn of the original Delta table is populated with the values from\nupdates\n, overwriting any existing values in the\naddress\ncolumn.\nIf\nupdates\ncontains customers that are not already in the\ncustomers table\n, then the command adds these new customer records.\nFor more examples of using\nMERGE INTO\n, see Merge Into (Delta Lake) (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/bad-request-error-when-creating-a-table-from-a-shared-catalog.json b/scraped_kb_articles/bad-request-error-when-creating-a-table-from-a-shared-catalog.json new file mode 100644 index 0000000000000000000000000000000000000000..5c6045fbc66878c708deee84d4616fb08c791784 --- /dev/null +++ b/scraped_kb_articles/bad-request-error-when-creating-a-table-from-a-shared-catalog.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/bad-request-error-when-creating-a-table-from-a-shared-catalog", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to create a table from a Delta Sharing catalog in your Databricks workspace and you get an error message.\nExample error\nUncheckedExecutionException: io.delta.sharing.spark.util.UnexpectedHttpStatus: HTTP request failed with status: HTTP/1.1 400 Bad Request\nExample error details\n\"error_code\" : \"BAD_REQUEST\", \"message\" : \"Failed request to sharing server\\nEndpoint: https://.databricks.com:443/api/2.0/delta-sharing/metastores//shares//schemas//tables/menu/metadata\\nMethod: GET\\nHTTP Code: 400\\nStatus Line: 400\\nBody: { \\\"error_code\\\" : \\\"BAD_REQUEST\\\", \\\"message\\\" : \\\"\\\\nTable property\\\\ndelta.enableDeletionVectors\\\\nis found in table version: 1.\\\\nHere are a couple options to proceed:\\\\n 1. Use DBR version 14.1(14.2 for CDF and streaming) or higher or delta-sharing-spark with version 3.1 or higher and set option (\\\\\\\"responseFormat\\\\\\\", \\\\\\\"delta\\\\\\\") to query the table.\\\\n 2. Contact your provider to ensure the table is shared with full history.\\\\n[Trace Id: 58bd2d3f99bdcd2f2a302f977074]\\\"}\\n[Trace Id: 58bd2d3f99bdcd2f2a302f977074]\\n\"\nThis issue typically arises in environments utilizing Delta Sharing and a mixture of Databricks Runtime versions.\nCause\nThe table being accessed has the\nenableDeletionVectors\nproperty enabled, which is not supported by the Databricks Runtime version you are using. The server returns a\n400 Bad Request\nerror.\nFor more information, review the\nWhat are deletion vectors?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nIf you are using Delta Sharing, you should upgrade your clients to a supported Databricks Runtime (recommended) or disable the\nenableDeletionVectors\nproperty on the source Delta table.\nUpgrade Databricks Runtime\nInfo\nYou should use Databricks Runtime 14.3 LTS or above if you want to use CDF (Change Data Feed) or streaming.\nStop your cluster.\nChange the Databricks Runtime to 14.1 or above.\nRestart the cluster.\nCreate the table.\nThis approach ensures compatibility with the\nenableDeletionVectors\nproperty.\nDisable enableDeletionVectors\nIf you cannot upgrade your Databricks Runtime version, you can disable the\nenableDeletionVectors\nproperty on the source Delta table. This makes the source table readable by Delta Lake clients that do not support deletion vectors.\nEnsure that you have\nMODIFY\npermissions on the source Delta table.\nAccess the table from a cluster running Databricks Runtime 14.1 or above.\nFollow the directions in the\nDrop Delta table features\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to disable the\nenableDeletionVectors\nproperty.\nRun\nVACUUM\nto delete old files and ensure no concurrent write operations are running.\nAccess the table from the client." +} \ No newline at end of file diff --git a/scraped_kb_articles/bchashjoin-exceeds-bcjointhreshold-oom.json b/scraped_kb_articles/bchashjoin-exceeds-bcjointhreshold-oom.json new file mode 100644 index 0000000000000000000000000000000000000000..e68a522d46047ba3d1f0662a8eafb8534ba76309 --- /dev/null +++ b/scraped_kb_articles/bchashjoin-exceeds-bcjointhreshold-oom.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/bchashjoin-exceeds-bcjointhreshold-oom", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table.\nDespite the total size exceeding the limit set by\nspark.sql.autoBroadcastJoinThreshold\n,\nBroadcastHashJoin\nis used and Apache Spark returns an\nOutOfMemorySparkException\nerror.\norg.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=1073741824. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1\nCause\nThis is due to a limitation with Spark’s size estimator.\nIf the estimated size of one of the DataFrames is less than the\nautoBroadcastJoinThreshold\n, Spark may use\nBroadcastHashJoin\nto perform the join. If the available nodes do not have enough resources to accommodate the broadcast DataFrame, your job fails due to an out of memory error.\nSolution\nThere are three different ways to mitigate this issue.\nUse\nANALYZE TABLE\n(\nAWS\n|\nAzure\n) to collect details and compute statistics about the DataFrames before attempting a join.\nCache the table (\nAWS\n|\nAzure\n) you are broadcasting.\nRun\nexplain\non your join command to return the physical plan.\n%sql\r\nexplain()\nReview the physical plan. If the broadcast join returns\nBuildLeft\n, cache the left side table. If the broadcast join returns\nBuildRight\n, cache the right side table.\nIn Databricks Runtime 7.0 and above, set the join type to\nSortMergeJoin\nwith join hints enabled." +} \ No newline at end of file diff --git a/scraped_kb_articles/behavioral-changes-for-the-char-data-type-on-serverless.json b/scraped_kb_articles/behavioral-changes-for-the-char-data-type-on-serverless.json new file mode 100644 index 0000000000000000000000000000000000000000..418b4c86394b6710b2e5fbe2e13ad07d2c23784f --- /dev/null +++ b/scraped_kb_articles/behavioral-changes-for-the-char-data-type-on-serverless.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/behavioral-changes-for-the-char-data-type-on-serverless", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a column in a Delta table defined as\nCHAR(10)\n, but the row data contains less than 10 characters. If you try to retrieve a row with less than 10 characters, no data is returned. However, if you use the\nupper\nor\ntrim\nfunctions and try to retrieve a row, you get the correct results.\nExample\nThis example code creates a table with\nCHAR(10)\nand inserts a record with value “ABCDEF”. When reading the same value back , It returns zero records.\n%sql\r\n\r\ncreate or replace table (id DECIMAL(18,0),source CHAR(10), updated CHAR(1)) USING delta;\r\ninsert into values(1234567890, 'ABCDEF','Y');\r\nselect * from where source = 'ABCDEF'\nCause\nCHAR\ntype fields that have less than the declared length are padded with spaces on read to complete the missing length. For example, a row in a\nCHAR(10)\ncolumn with the value “ABCDEF” is padded as “ABCDEF    “ (six letters and four spaces for a total of 10 characters) when a read is performed.\nInfo\nThe\nCHAR\npad-on-read behavior change was introduced in Serverless 2024.15. For more information, review the\nServerless 2024.15 release notes\n(\nAWS\n|\nAzure\n|\nGCP\n).\nSolution\nAs a best practice you should pad your reads with spaces, so they match the declared length of the\nCHAR\nfield.\nIf you do not want to pad your reads, you can set\nspark.conf.set(\"spark.sql.legacy.charVarcharAsString\", \"true\")\nat the beginning of your notebook. This configures a\nCHAR\n/\nVARCHAR\nfield to behave as a string, so you can query just the values without padded spaces.\nExample\n%python\r\n\r\nspark.conf.set(\"spark.sql.legacy.charVarcharAsString\", \"true\")\n%sql\r\n\r\nselect * from  where source = 'ABCDEF';" +} \ No newline at end of file diff --git a/scraped_kb_articles/broadcast-join-hash-not-being-used-despite-hints.json b/scraped_kb_articles/broadcast-join-hash-not-being-used-despite-hints.json new file mode 100644 index 0000000000000000000000000000000000000000..e8f47a5d1fd18338888edbed05aa26610a2a2dd6 --- /dev/null +++ b/scraped_kb_articles/broadcast-join-hash-not-being-used-despite-hints.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/broadcast-join-hash-not-being-used-despite-hints", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen performing join transformations in Apache Spark, you notice the expected broadcast hash join is not being used, although you provide broadcast join hints.\nThe issue occurs even when one of the datasets qualifies for broadcast exchange. Additionally, you have tried Adaptive Query Execution (AQE) join optimizations, but they do not resolve the issue.\nCause\nThe table statistics are outdated. Spark estimates a larger data size than the actual size.\nAlternatively, you’re using a full outer join. Spark does not support broadcast joins with this join type.\nNote\nFor right outer joins only the left-side table can be broadcast, and for left outer joins only the right-side table can be broadcast.\nIf the transformation involves a left outer join, and the right DataFrame is not small enough to fit into memory, Spark resorts to a sort merge join. AQE skew optimization will also resort to a sort merge join in this situation.\nSolution\nUse one or more of the following options in combination to resolve the issue.\nRefresh your table’s statistics using the following command. Refreshing also helps Spark make accurate decisions during query optimization.\n%sql analyze table  compute statistics\nIf you are using a full outer join, use an inner join, left outer join, or right outer join instead depending on your use case. When possible, structure the query such that the smaller table is on the non-outer side to allow for broadcasting.\nIncrease the size of the broadcast threshold if you have enough compute memory to support broadcasting the data (the default is 10 MB). Use the following property.\n%sql\r\nset spark.sql.autoBroadcastJoinThreshold = ;\r\nset spark.databricks.adaptive.autoBroadcastJoinThreshold = ;" +} \ No newline at end of file diff --git a/scraped_kb_articles/broadcast_variable_not_loaded-or-jvm_attribute_not_supported-errors-when-using-broadcast-variables-in-a-shared-access-mode-cluster.json b/scraped_kb_articles/broadcast_variable_not_loaded-or-jvm_attribute_not_supported-errors-when-using-broadcast-variables-in-a-shared-access-mode-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..444ee09977c82b45961975eb69a622be212bfe5e --- /dev/null +++ b/scraped_kb_articles/broadcast_variable_not_loaded-or-jvm_attribute_not_supported-errors-when-using-broadcast-variables-in-a-shared-access-mode-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/broadcast_variable_not_loaded-or-jvm_attribute_not_supported-errors-when-using-broadcast-variables-in-a-shared-access-mode-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re trying to broadcast a variable on a shared access mode cluster and receive error messages such as\nBROADCAST_VARIABLE_NOT_LOADED\nor\nJVM_ATTRIBUTE_NOT_SUPPORTED\n.\nCause\nDatabricks shared access mode clusters do not support broadcast variables due to their enhanced isolation architecture. Trying to use broadcast variables will lead to the error\nBROADCAST_VARIABLE_NOT_LOADED\n.\nIf you are using shared clusters in Databricks Runtime 14.0 and above, you see\nJVM_ATTRIBUTE_NOT_SUPPORTED\nin PySpark or\nvalue sparkContext is not a member of org.apache.spark.sql.SparkSession\nin Scala.\nSolution\nIf you need to use broadcast variables, Databricks recommends using single-user clusters for such workloads. This will allow you to bypass the isolation and enable the use of broadcast variables.\nIf you prefer to continue using a shared cluster, pass the variables into functions as a state parameter instead of using broadcast variables.\nFor more information on shared access mode limitations, please refer to the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/bucketing.json b/scraped_kb_articles/bucketing.json new file mode 100644 index 0000000000000000000000000000000000000000..c399842118ace07ae3b0162dfa80eed1e9d26337 --- /dev/null +++ b/scraped_kb_articles/bucketing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/bucketing", + "title": "Título do Artigo Desconhecido", + "content": "Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and sorting, but for certain data transformations, this technique can improve performance by avoiding later shuffling and sorting.\nThis technique is useful for dimension tables, which are frequently used tables containing primary keys. It is also useful when there are frequent join operations involving large and small tables.\nThe example notebook below shows the differences in physical plans when performing joins of bucketed and unbucketed tables.\nBucketing example notebook\nOpen notebook in new tab\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/bulk-update-workflow-permissions-for-a-group.json b/scraped_kb_articles/bulk-update-workflow-permissions-for-a-group.json new file mode 100644 index 0000000000000000000000000000000000000000..6d8ddf495cd56a5c1255bf77465f89623f323902 --- /dev/null +++ b/scraped_kb_articles/bulk-update-workflow-permissions-for-a-group.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/bulk-update-workflow-permissions-for-a-group", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how you can use the Databricks Jobs API to grant a single group permission to access all the jobs in your workspace.\nInfo\nYou must be a workspace administrator to perform the steps detailed in this article.\nInstructions\nUse the following sample code to give a specific group of users permission for all the jobs in your workspace.\nInfo\nTo get your workspace URL, review\nWorkspace instance names, URLs, and IDs\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nGenerate a personal access token\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details on how to create a personal access token for use with the REST APIs.\nStart a cluster in your workspace and attach a notebook.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nUpdate the\n\n(\nAWS\n|\nAzure\n|\nGCP\n) value.\nUpdate the\n\nvalue with the name of the user group you are granting permissions.\nRun the notebook cell.\n%python\r\n\r\nshard_url = \"\"\r\naccess_token = \"\"\r\ngroup_name = \"\"\r\nheaders_auth = {\r\n 'Authorization': f'Bearer {access_token}'\r\n}\r\njob_list_url = shard_url+\"/api/2.1/jobs/list\"\r\njobs_list = requests.request(\"GET\", job_list_url, headers=headers_auth).json()\r\nfor job in jobs_list['jobs']: \r\n job_id = job['job_id']\r\n job_change_url = shard_url+\"/api/2.0/preview/permissions/jobs/\"+str(job_id)\r\n payload_pause_schedule = json.dumps({\r\n \"access_control_list\": [\r\n {\r\n \"group_name\":group_name,\r\n \"permission_level\": \"\"\r\n }\r\n ]\r\n })\r\n response = requests.request(\"PATCH\", job_change_url, headers=headers_auth, data=payload_pause_schedule)\r\n \r\nprint(\"Permissions Updated\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/calculate-number-of-cores.json b/scraped_kb_articles/calculate-number-of-cores.json new file mode 100644 index 0000000000000000000000000000000000000000..fa71e573b503a5ca9b5b5809559e11edaa5ff733 --- /dev/null +++ b/scraped_kb_articles/calculate-number-of-cores.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/calculate-number-of-cores", + "title": "Título do Artigo Desconhecido", + "content": "You can view the number of cores in a Databricks cluster in the Workspace UI using the\nMetrics\ntab on the cluster details page.\nDelete\nNote\nAzure Databricks cluster nodes must have a metrics service installed.\nIf the driver and executors are of the same node type, you can also determine the number of cores available in a cluster programmatically, using Scala utility code:\nUse\nsc.statusTracker.getExecutorInfos.length\nto get the total number of nodes. The result includes the driver node, so subtract 1.\nUse\njava.lang.Runtime.getRuntime.availableProcessors\nto get the number of cores per node.\nMultiply both results (subtracting 1 from the total number of nodes) to get the total number of cores available.\nScala example code:\njava.lang.Runtime.getRuntime.availableProcessors * (sc.statusTracker.getExecutorInfos.length -1)" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-access-apache-sparkcontext-object-using-addpyfile.json b/scraped_kb_articles/cannot-access-apache-sparkcontext-object-using-addpyfile.json new file mode 100644 index 0000000000000000000000000000000000000000..98eaa7110e44428889a00a5950d75fee3966fa19 --- /dev/null +++ b/scraped_kb_articles/cannot-access-apache-sparkcontext-object-using-addpyfile.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cannot-access-apache-sparkcontext-object-using-addpyfile", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile using Databricks Connect with VSCode, you notice you can’t directly access the Apache SparkContext object using the addPyFile API to add simple files. The error stack trace indicates an attribute is not supported.\n[JVM_ATTRIBUTE_NOT_SUPPORTED] Directly accessing the underlying Spark driver JVM using the attribute 'sparkContext' is not supported on shared clusters.\nCause\nSparkContext is not available when using Databricks Connect for addPyFile API on a cluster in Databricks Runtime 14.0 and above.\nSolution\nLeverage the addArtifact API to add simple files instead. Start by setting up your Spark Connect session.\nExample in Python\nAfter setting up your Spark Connect session, upload files. Various filetypes are included in the following code.\nfrom pyspark.sql import SparkSession\r\n\r\n# 1. Set up Spark Connect session\r\nspark = SparkSession.builder().remote(\"sc://localhost\").build()\r\n\r\n# 2. Upload files\r\n\r\n# Arbitrary files\r\nspark.addArtifact(, file=True)\r\n\r\n# .py, .egg, .zip or .jar, automatically added into PYTHONPATH\r\nspark.addArtifact(, pyfile=True)\r\n\r\n# .zip, .jar, .tar.gz, .tgz, or .tar, automatically untarred in current working directory of UDF execution\r\nspark.addArtifact(, archive=True)\nExample in Scala\nAfter setting up your Spark Connect session, register a ClassFinder to monitor and upload the class files from the build output, then upload your JAR dependencies.\nimport org.apache.spark.sql.SparkSession\r\nimport org.apache.spark.sql.connect.client.REPLClassDirMonitor\r\n\r\n// 1. Set up Spark Connect session\r\nval spark = SparkSession.builder().remote(\"sc://localhost\").build()\r\n\r\n// 2. Register a ClassFinder to monitor and upload the classfiles from the build output.\r\nval classFinder = new REPLClassDirMonitor()\r\nspark.registerClassFinder(classFinder)\r\n\r\n// 3. Upload JAR dependencies\r\nspark.addArtifact()\nFor more information, refer to the\npyspark.sql.SparkSession.addArtifact\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-access-databricks-secrets-when-using-a-no-isolation-shared-cluster.json b/scraped_kb_articles/cannot-access-databricks-secrets-when-using-a-no-isolation-shared-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..3f63d49d5f72eadacc71dca998ee9fc9dfa31708 --- /dev/null +++ b/scraped_kb_articles/cannot-access-databricks-secrets-when-using-a-no-isolation-shared-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/cannot-access-databricks-secrets-when-using-a-no-isolation-shared-cluster", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-alter-existing-column-to-delta-lake-generated-column.json b/scraped_kb_articles/cannot-alter-existing-column-to-delta-lake-generated-column.json new file mode 100644 index 0000000000000000000000000000000000000000..ffae893d45e1d0ddc8fa13159ded2f7d0d4208e0 --- /dev/null +++ b/scraped_kb_articles/cannot-alter-existing-column-to-delta-lake-generated-column.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/cannot-alter-existing-column-to-delta-lake-generated-column", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to use the\nALTER\ncommand to update an existing column to a generated column in a Delta table. You encounter the following error.\n[PARSE_SYNTAX_ERROR] Syntax error at or near 'GENERATED'. SQLSTATE: 42601\nCause\nYou must specify generated column expressions during table creation.\nOnce the table is created, you cannot alter a column to make it a generated column or change its generated expression. Generated columns are computed based on other columns in the table, and changing their definition requires rewriting the existing data.\nSolution\nRecreate the table with the column defined with the generated column definition.\n%sql\r\n   CREATE TABLE new_table (\r\n     -- other columns\r\n     date_utc DATE GENERATED ALWAYS AS (),\r\n     -- other columns\r\n   ) USING DELTA PARTITIONED BY (date_utc);\nFor more information, refer to the\nDelta Lake Generated columns\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-create-cluster-spark-conf-sparkdatabricksclusterprofile-is-not-allowed-when-choosing-an-access-mode.json b/scraped_kb_articles/cannot-create-cluster-spark-conf-sparkdatabricksclusterprofile-is-not-allowed-when-choosing-an-access-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..66536d544fac34f1aa74c292601cacbd95cc94f6 --- /dev/null +++ b/scraped_kb_articles/cannot-create-cluster-spark-conf-sparkdatabricksclusterprofile-is-not-allowed-when-choosing-an-access-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cannot-create-cluster-spark-conf-sparkdatabricksclusterprofile-is-not-allowed-when-choosing-an-access-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to create a single node cluster with the API or Terraform, the creation fails with the following error.\n“Cannot create cluster: spark conf: 'spark.databricks.cluster.profile' is not allowed when choosing an access mode”\nCause\nYou have the Apache Spark config\nspark.databricks.cluster.profile\nadded explicitly in the create new cluster API or cluster resource block.\nSolution\nUse the API flag\n“is_single_node”\ninstead of specifying a Spark config. When set to\ntrue\n, Databricks will automatically set single node-related\ncustom_tags\n,\nspark_conf\n, and\nnum_workers\n.\nFor more information, refer to the\nCreate new cluster\n(\nAWS\n|\nGCP\n|\nAzure\n) API documentation.\nThe following code is an example using the API documentation to create a single node Databricks Runtime 14.3 LTS compute resource.\n{\r\n  \"aws_attributes\": {\r\n    \"availability\": \"SPOT_WITH_FALLBACK\",\r\n    \"ebs_volume_count\": 0,\r\n    \"first_on_demand\": 1,\r\n    \"spot_bid_price_percent\": 100,\r\n    \"zone_id\": \"auto\"\r\n  },\r\n  \"cluster_name\": \"single-node-with-kind-cluster\",\r\n  \"is_single_node\": true,\r\n  \"kind\": \"CLASSIC_PREVIEW\",\r\n  \"node_type_id\": \"i3.xlarge\",\r\n  \"spark_version\": \"14.3.x-scala2.12\"\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-create-grants-terraform-error-when-using-multiple-databricks_grants-blocks-for-the-same-catalog.json b/scraped_kb_articles/cannot-create-grants-terraform-error-when-using-multiple-databricks_grants-blocks-for-the-same-catalog.json new file mode 100644 index 0000000000000000000000000000000000000000..8843938576d1fd3db6b7794e6de8d1771dd83af7 --- /dev/null +++ b/scraped_kb_articles/cannot-create-grants-terraform-error-when-using-multiple-databricks_grants-blocks-for-the-same-catalog.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/cannot-create-grants-terraform-error-when-using-multiple-databricks_grants-blocks-for-the-same-catalog", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen applying Terraform configurations using multiple\ndatabricks_grants\nresource blocks for assigning permissions to a single Databricks catalog, you encounter an error similar to the following.\n“Error: cannot create grants: permissions for catalog- are &{[{group-or-user-name [PERMISSION] [Principal]} ...]}, but have to be {[{group-or-user-name [PERMISSION] []} ...]}”\nExample code\nresource \"databricks_grants\" \"catalog_permissions_1\" {\r\n  catalog = \"example_catalog\"\r\n  grant {\r\n    principal  = \"group_a\"\r\n    privileges = [\"USE_CATALOG\"]\r\n  }\r\n}\r\n\r\nresource \"databricks_grants\" \"catalog_permissions_2\" {\r\n  catalog = \"example_catalog\"\r\n  grant {\r\n    principal  = \"group_b\"\r\n    privileges = [\"USE_CATALOG\"]\r\n  }\r\n}\nCause\nStarting with Databricks Terraform provider version 1.23.0 (released August 2023), changes to permission handling result in issues when multiple\ndatabricks_grants\nresource blocks are used for the same catalog. A fix was implemented in version 1.34.0 (released January 2024), introducing a new resource type:\ndatabricks_grant\n.\nFor more information, review the details in the Github issue\n[ISSUE] different databricks_grants inside different modules overwrite each other #2704\n.\nFor more information on the version 1.34.0 release, review the Github\nRelease v1.34.0 #3105\ndocumentation.\nSolution\nThere are two options available.\nThe recommended approach is to use dynamic blocks within a single\ndatabricks_grants\nresource to handle multiple principals and privileges simultaneously. For more information, refer to the\ndatabricks_grants Resource\ndocumentation.\nAlternatively, if you use Terraform version 1.34.0, replace the use of multiple\ndatabricks_grants\nresource blocks for the same catalog with the new resource type\ndatabricks_grant\n.\nExample code using\ndatabricks_grant\nresource \"databricks_grant\" \"catalog_permission_group_a\" {\r\n  catalog    = \"\"\r\n  principal  = \"group_a\"\r\n  privileges = [\"USE_CATALOG\"]\r\n}\r\n\r\nresource \"databricks_grant\" \"catalog_permission_group_b\" {\r\n  catalog    = \"\"\r\n  principal  = \"group_b\"\r\n  privileges = [\"USE_CATALOG\"]\r\n}\nFor more information, refer to the\ndatabricks_grant Resource\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-customize-apache-spark-config-in-databricks-sql-warehouse.json b/scraped_kb_articles/cannot-customize-apache-spark-config-in-databricks-sql-warehouse.json new file mode 100644 index 0000000000000000000000000000000000000000..86da03428da289603414b230ab9c83c17199ec8c --- /dev/null +++ b/scraped_kb_articles/cannot-customize-apache-spark-config-in-databricks-sql-warehouse.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/cannot-customize-apache-spark-config-in-databricks-sql-warehouse", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to set Apache Spark configuration properties in Databricks SQL warehouses like you do on standard clusters.\nCause\nDatabricks SQL is a managed service. You cannot modify the Spark configuration properties on a SQL warehouse. This is by design.\nYou can only configure a limited set of global Spark properties that apply to all SQL warehouses in your workspace.\nSolution\nReview the SQL warehouse data access configuration\nsupported properties\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for information on Spark properties you can apply to all SQL warehouses in your workspace.\nIf you have a unique need to set a specific Spark configuration property that is preventing you from using Databricks SQL please contact your Databricks representative." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-delete-a-user-from-a-databricks-account.json b/scraped_kb_articles/cannot-delete-a-user-from-a-databricks-account.json new file mode 100644 index 0000000000000000000000000000000000000000..21c379058aae24aec9dc8fdbe28876279d0703d1 --- /dev/null +++ b/scraped_kb_articles/cannot-delete-a-user-from-a-databricks-account.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/cannot-delete-a-user-from-a-databricks-account", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are an admin in your Databricks account, trying to delete a user (by email) from the account console. You receive the following error.\nEither missing permissions to delete or deleting own account.\nCause\nThe user you are trying to delete is the owner of the Databricks account. If the to-be-deleted user is not an account owner, you may be instead trying to delete your own account, which is not possible and results in this error.\nSolution\nFirst verify whether the to-be-deleted user is indeed the account owner and update the account ownership. If you are not sure who your Databricks account owner is, contact Databricks Support to retrieve this information.\nTo update account ownership, contact your Databricks account team or raise a support ticket to initiate ownership transfer. Once the ownership is updated, you can delete the user.\nIf neither of the causes pertains to you and you continue to encounter the error, contact Databricks Support to help you investigate further." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-delete-permissions-error-when-trying-to-manage-sql-warehouse-permissions-through-terraform.json b/scraped_kb_articles/cannot-delete-permissions-error-when-trying-to-manage-sql-warehouse-permissions-through-terraform.json new file mode 100644 index 0000000000000000000000000000000000000000..7716cd176238e27a00b4c2b2f723d8089e3cbacb --- /dev/null +++ b/scraped_kb_articles/cannot-delete-permissions-error-when-trying-to-manage-sql-warehouse-permissions-through-terraform.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/terraform/cannot-delete-permissions-error-when-trying-to-manage-sql-warehouse-permissions-through-terraform", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to manage SQL Warehouse permissions through Terraform, you encounter the following error.\n“Error: cannot delete permissions: PUT requests for warehouse with no existing owner must provide a new owner.”\nYour SQL warehouses were initially created without an owner through Terraform, and you’re trying to assign the\nIS_OWNER\nproperty to the deployment principal or a user during the deployment process.\nCause\nA change was introduced in version\nv1.14.0\nof the Databricks Terraform provider, which automatically adds the\nCAN_MANAGE\npermission on\ndatabricks_sql_endpoint\nfor the calling user.\nThis creates a conflict where the deployment principal or user is being assigned both\nIS_OWNER\nand\nCAN_MANAGE\npermissions, which leads to only the\nCAN_MANAGE\npermission being added. This behavior is due to how the\nPUT /api/2.0/permissions/sql/warehouses\nAPI handles permission assignments.\nSolution\nEnsure the user or principal being granted the\nIS_OWNER\npermission is not the same entity acting as the deployment principal used for the Terraform deployment.\nExample\nCheck that the\nservice_principal_name\nis pointing to a different service principal user than the one you are using for the Terraform apply operation.\nresource \"databricks_permissions\" \"dbx_genie_sql_warehouse\" {\r\nprovider = \r\nsql_endpoint_id = .dbx_genie_sql_warehouse.id\r\naccess_control {\r\nservice_principal_name = data...\r\npermission_level = \"IS_OWNER\"" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-delete-unity-catalog-metastore-using-terraform.json b/scraped_kb_articles/cannot-delete-unity-catalog-metastore-using-terraform.json new file mode 100644 index 0000000000000000000000000000000000000000..2e4dd577b181f326e4f7f582135cf0b17d52694f --- /dev/null +++ b/scraped_kb_articles/cannot-delete-unity-catalog-metastore-using-terraform.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/cannot-delete-unity-catalog-metastore-using-terraform", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou cannot delete the Unity Catalog metastore using Terraform.\nCause\nThe default catalog is auto-created with a metastore. As a result, you cannot delete the metastore without first wiping the catalog.\nSolution\nSet\nforce_destory = true\nin the\ndatabricks_metastore\nsection of the Terraform configuration to delete the metastore and the corresponding catalog.\nExample code:\nresource \"databricks_metastore\" \"this\" {\r\n  provider      = databricks.workspace\r\n  name          = \"primary\"\r\n  storage_root  = \"s3://${aws_s3_bucket.metastore.id}/metastore\"\r\n  owner         = var.unity_admin_group\nforce_destroy = true\n}\r\n\r\n\r\nresource \"databricks_metastore_data_access\" \"this\" {\r\n  provider     = databricks.workspace\r\n  metastore_id = databricks_metastore.this.id\r\n  name         = aws_iam_role.metastore_data_access.name\r\n  aws_iam_role {\r\n    role_arn = aws_iam_role.metastore_data_access.arn\r\n  }\r\n  is_default = true\r\n}\r\n\r\n\r\nresource \"databricks_metastore_assignment\" \"default_metastore\" {\r\n  provider             = databricks.workspace\r\n  for_each             = toset(var.databricks_workspace_ids)\r\n  workspace_id         = each.key\r\n  metastore_id         = databricks_metastore.this.id\r\n  default_catalog_name = \"hive_metastore\"\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-deploy-the-agent-model-with-a-compute-resource-assigned-to-a-group.json b/scraped_kb_articles/cannot-deploy-the-agent-model-with-a-compute-resource-assigned-to-a-group.json new file mode 100644 index 0000000000000000000000000000000000000000..36a8303f19b6e65595846c591502acc7f3654a87 --- /dev/null +++ b/scraped_kb_articles/cannot-deploy-the-agent-model-with-a-compute-resource-assigned-to-a-group.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cannot-deploy-the-agent-model-with-a-compute-resource-assigned-to-a-group", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re using a compute resource assigned to a group using the dedicated access mode to deploy agents. You’re using code such as that in the example notebook section of the\nUse Genie in multi-agent systems\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nWhen you try to run the\nagents.deploy()code\n, you encounter the following error.\nThe creator of the model version cannot be authenticated. Please re-log the model with a valid user. Config: host=, auth_type=runtime File \nCause\nCompute resources assigned to groups cannot create and access model serving endpoints.\nFor more information, review the\nAssign compute resources to a group\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nContinue to use dedicated access mode, but assign the compute to a user or service principal to create and deploy the agent model." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-grow-bufferholder-exceeds-size.json b/scraped_kb_articles/cannot-grow-bufferholder-exceeds-size.json new file mode 100644 index 0000000000000000000000000000000000000000..11e892816fba1b7a0044f494590b5c58bd945c65 --- /dev/null +++ b/scraped_kb_articles/cannot-grow-bufferholder-exceeds-size.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/cannot-grow-bufferholder-exceeds-size", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Apache Spark job fails with an\nIllegalArgumentException: Cannot grow BufferHolder\nerror.\njava.lang.IllegalArgumentException: Cannot grow BufferHolder by size XXXXXXXXX because the size after growing exceeds size limitation 2147483632\nCause\nBufferHolder\nhas a maximum size of 2147483632 bytes (approximately 2 GB).\nIf a column value exceeds this size, Spark returns the exception.\nThis can happen when using aggregates like\ncollect_list\n.\nThis example code generates duplicates in the column values which exceed the maximum size of\nBufferHolder\n. As a result, it returns an\nIllegalArgumentException: Cannot grow BufferHolder\nerror when run in a notebook.\n%sql\r\n\r\nimport org.apache.spark.sql.functions._\r\nspark.range(10000000).withColumn(\"id1\",lit(\"jkdhdbjasdshdjkqgdkdkasldksashjckabacbaskcbakshckjasbc$%^^&&&&&*jxcfdkwbfkjwdqndlkjqslkndskbndkjqbdjkbqwjkdbxnsa xckqjwbdxsabvnxbaskxqbhwdhqjskdjxbqsjdhqkjsdbkqsjdkjqdhkjqsabcxns ckqjdkqsbcxnsab ckjqwbdjckqscx ns csjhdjkqsdhjkqshdjsdhqksjdhxqkjshjkshdjkqsdhkjqsdhjqskxb kqscbxkjqsc\")).groupBy(\"id1\").\r\nagg(collect_list(\"id1\").alias(\"days\")).\r\nshow()\nSolution\nYou must ensure that column values do not exceed 2147483632 bytes. This may require you to adjust how you process data in your notebook.\nLooking at our example code, using collect_set instead of collect_list, resolves the issue and allows the example to run to completion. This single change works because the example data set contains a large number of duplicate entries.\n%sql\r\n\r\nimport org.apache.spark.sql.functions._\r\nspark.range(10000000).withColumn(\"id1\",lit(\"jkdhdbjasdshdjkqgdkdkasldksashjckabacbaskcbakshckjasbc$%^^&&&&&*jxcfdkwbfkjwdqndlkjqslkndskbndkjqbdjkbqwjkdbxnsa xckqjwbdxsabvnxbaskxqbhwdhqjskdjxbqsjdhqkjsdbkqsjdkjqdhkjqsabcxns ckqjdkqsbcxnsab ckjqwbdjckqscx ns csjhdjkqsdhjkqshdjsdhqksjdhxqkjshjkshdjkqsdhkjqsdhjqskxb kqscbxkjqsc\")).groupBy(\"id1\").\r\nagg(collect_set(\"id1\").alias(\"days\")).\r\nshow()\nIf using\ncollect_set\ndoes not keep the size of the column below the\nBufferHolder\nlimit of 2147483632 bytes, the\nIllegalArgumentException: Cannot grow BufferHolder\nerror still occurs. In this case, we would have to split the list into multiple DataFrames and write it out as separate files." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-import-egg-module.json b/scraped_kb_articles/cannot-import-egg-module.json new file mode 100644 index 0000000000000000000000000000000000000000..4dd25f1809b34a97010d75217ead70b746baac10 --- /dev/null +++ b/scraped_kb_articles/cannot-import-egg-module.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/cannot-import-egg-module", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou try to install an egg library to your cluster and it fails with a message that the a module in the library cannot be imported.\nEven a simple import fails.\nimport sys\r\negg_path='/dbfs//.egg'\r\nsys.path.append(egg_path)\r\nimport shap_master\nCause\nThis error message occurs due to the way the library is packed.\nSolution\nIf the standard library import options do not work, you should use\neasy_install\nto import the library.\n%python\r\n\r\ndbutils.fs.put(\"//.sh\",\"\"\"\r\n#!/bin/bash\r\neasy_install-3.7 /dbfs//.egg\"\"\")\r\n\"\"\")\nDelete\nWarning\nThe version of\neasy_install\nmust match the version of Python on the cluster. You can determine the version of Python on your cluster by reviewing the release notes (\nAWS\n|\nAzure\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-import-tabularprediction.json b/scraped_kb_articles/cannot-import-tabularprediction.json new file mode 100644 index 0000000000000000000000000000000000000000..c7df4f892d771da7c5516d0824e9f945747618a1 --- /dev/null +++ b/scraped_kb_articles/cannot-import-tabularprediction.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/cannot-import-tabularprediction", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to import\nTabularPrediction\nfrom AutoGluon, but are getting an error message.\nImportError: cannot import name 'TabularPrediction' from 'autogluon' (unknown location)\nThis happens when AutoGluon is installed via a notebook or as a cluster-installed library (\nAWS\n|\nAzure\n|\nGCP\n).\nYou can reproduce the error by running the import command in your notebook:\n%python\r\n\r\nimport autogluon as ag\r\nfrom autogluon import TabularPrediction as task\nCause\nThere is a namespace collision in AutoGluon v0.0.14.\nautogluon==0.0.14\ninstalls\ngluoncv>=0.5.0,<1.0\n. This results in\ngluoncv==0.9.0\ngetting installed, which creates the namespace collision.\nSolution\nThe namespace collision was resolved in AutoGluon v0.0.15. Upgrade to AutoGluon v0.0.15 to use\nTabularPrediction\n.\nSpecify\nautogluon==0.0.15\nwhen installing AutoGluon as a cluster-installed library from PyPI.\nYou can also install it via a notebook.\n%sh\r\npip install autogluon==0.0.15 autogluon.tabular \"mxnet<2.0.0\"\nAfter you have upgraded to AutoGluon v0.0.15, you can successfully import\nTabularPrediction\n.\n%python\r\n\r\nimport autogluon as ag\r\nfrom autogluon import TabularPrediction as task" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-import-timestamp-millis-unix-millis.json b/scraped_kb_articles/cannot-import-timestamp-millis-unix-millis.json new file mode 100644 index 0000000000000000000000000000000000000000..114114f3a0316e97afee0f3805227d501f31cde8 --- /dev/null +++ b/scraped_kb_articles/cannot-import-timestamp-millis-unix-millis.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/cannot-import-timestamp-millis-unix-millis", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to import\ntimestamp_millis\nor\nunix_millis\ninto a Scala notebook, but get an error message.\n%scala\r\n\r\nimport org.apache.spark.sql.functions.{timestamp_millis, unix_millis}\nerror: value timestamp_millis is not a member of object org.apache.spark.sql.functions\r\nimport org.apache.spark.sql.functions.{timestamp_millis, unix_millis}\nCause\nThe functions\ntimestamp_millis\nand\nunix_millis\nare not available in the Apache Spark\nDataFrame API\n.\nThese functions are specific to SQL and are\nincluded in Spark 3.1.1 and above\n.\nSolution\nYou need to use\nselectExpr()\nwith\ntimestamp_millis\nor\nunix_millis\nif you want to use either one of them with a DataFrame.\nselectExpr()\ntakes a set of SQL expressions and runs them.\nFor example, this sample code returns an error message when run.\n%scala\r\n\r\nimport sqlContext.implicits._\r\nval df = Seq(\r\n (1, \"First Value\"),\r\n (2, \"Second Value\")\r\n).toDF(\"int_column\", \"string_column\")\r\n\r\nimport org.apache.spark.sql.functions.{unix_millis}\r\nimport org.apache.spark.sql.functions.col\r\ndf.select(unix_millis(col(\"int_column\"))).show()\nerror: value unix_millis is not a member of object org.apache.spark.sql.functions\r\nimport org.apache.spark.sql.functions.{unix_millis}\nWhile this sample code, using\nselectExpr()\n, successfully returns timestamp values.\n%scala\r\n\r\nimport org.apache.spark.sql.functions._\r\nimport sqlContext.implicits._\r\nval ndf = Seq(\r\n (1, \"First Value\"),\r\n (2, \"Second Value\")\r\n).toDF(\"int_column\", \"string_column\")\r\n\r\ndisplay(ndf.selectExpr(\"timestamp_millis(int_column)\"))\nExample notebook\nReview the\nCannot import timestamp_millis or unix_millis example notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-modify-spark-serializer.json b/scraped_kb_articles/cannot-modify-spark-serializer.json new file mode 100644 index 0000000000000000000000000000000000000000..ba3cf5e4d87670162a86c8acefa83b1bedee7576 --- /dev/null +++ b/scraped_kb_articles/cannot-modify-spark-serializer.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/cannot-modify-spark-serializer", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to\nSET\nthe value of a Spark config in a notebook and get a\nCannot modify the value of a Spark config\nerror.\nFor example:\n%sql\r\n\r\nSET spark.serializer=org.apache.spark.serializer.KryoSerializer\nError in SQL statement: AnalysisException: Cannot modify the value of a Spark config: spark.serializer;\nCause\nThe\nSET\ncommand does not work on\nSparkConf\nentries. This is by design in Spark 3.0 and above.\nSolution\nYou should remove\nSET\ncommands for\nSparkConf\nentries from your notebook.\nYou can enter\nSparkConf\nvalues at the cluster level by entering them in the cluster’s\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n) and restarting the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-see-how-to-change-the-owner-of-a-dlt-pipeline-in-the-databricks-ui.json b/scraped_kb_articles/cannot-see-how-to-change-the-owner-of-a-dlt-pipeline-in-the-databricks-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..4e6e3a7341b9d654f0612a5b8819a0a33417def0 --- /dev/null +++ b/scraped_kb_articles/cannot-see-how-to-change-the-owner-of-a-dlt-pipeline-in-the-databricks-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/cannot-see-how-to-change-the-owner-of-a-dlt-pipeline-in-the-databricks-ui", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are required to change the DLT pipeline owner, but there are no options visible in the UI to complete this task.\nCause\nThe user initiating the ownership change must be both an admin in both the metastore and workspace. If you have only one of these admin privileges (for example, you are a metastore admin but not a workspace admin), you will not be able to see the option to change the owner of the pipeline.\nSolution\nIn order to change a DLT pipeline owner, ask a user with both metastore and workspace admin privileges to execute the following steps.\nGo to the workspace where this pipeline is located.\nAt the left side menu, select\nWorkflows > Delta Live Tables\nFiler the required pipeline and select it.\nClick the ellipsis icon.\nClick the\nPermissions\nbutton in the Pipelines UI to expand the pipeline permissions modal.\nClick the\nX\nnext to the current owner to clear this owner.\nInclude the new owner as\n“Is Owner”\nin the pipeline permissions modal and click\nSave\n.\nAfter saving, all pipeline assets defined in the pipeline will be owned by the new pipeline owner. All future updates will be run using the identity of the new owner." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-see-how-to-query-the-number-of-users-per-workspace-in-sql-analytics.json b/scraped_kb_articles/cannot-see-how-to-query-the-number-of-users-per-workspace-in-sql-analytics.json new file mode 100644 index 0000000000000000000000000000000000000000..cec2aca29cbca26bafc8642ae64466bb8b79b532 --- /dev/null +++ b/scraped_kb_articles/cannot-see-how-to-query-the-number-of-users-per-workspace-in-sql-analytics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/cannot-see-how-to-query-the-number-of-users-per-workspace-in-sql-analytics", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working in SQL Analytics, you want to query the number of users per workspace. In order to do that, you need to identify which system tables to query.\nThis requirement often arises when managing user access and monitoring workspace usage.\nCause\nThe\nsystem.access.audit\ntable contains data on the number of users per workspace.\nSolution\nQuery the\nsystem.access.audit\ntable with appropriate filters and groupings.\nOpen your Databricks SQL Analytics environment.\nUse the following SQL query to retrieve the number of users per workspace.\n%sql\r\n\r\nSELECT\r\nworkspace_id,\r\nCOUNT(DISTINCT user_identity.email) as user_count\r\nFROM\r\nsystem.access.audit\r\nWHERE\r\nservice_name = 'accounts'\r\nAND action_name = 'tokenLogin'\r\nAND request_params.user LIKE '%@%'\r\nGROUP BY workspace_id;\nInfo\nThe query assumes that the user email addresses have a valid format.\nThis query selects the\nworkspace_id\nand counts the distinct user emails from the\nsystem.access.audit\ntable where the\nservice_name\nis\n'accounts'\nand the\naction_name\nis '\ntokenLogin'\n. The\nrequest_params.user\nfilter ensures that only valid user entries are considered. Grouping by\nworkspace_id\nprovides the user count for each workspace.\nFor more information, refer to the\nMonitor account activity with system tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-see-ingested-data-loaded-from-an-external-orc-table.json b/scraped_kb_articles/cannot-see-ingested-data-loaded-from-an-external-orc-table.json new file mode 100644 index 0000000000000000000000000000000000000000..5a9908bb1a587448e07c5d32e5ab0e82c41b53f5 --- /dev/null +++ b/scraped_kb_articles/cannot-see-ingested-data-loaded-from-an-external-orc-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/cannot-see-ingested-data-loaded-from-an-external-orc-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou ingest data into a Delta table from an Apache ORC table (outside Databricks), but cannot see the data when querying the partition columns in the Delta table.\nCause\nFrom Apache Spark 2.4 onwards, the Hive interface configuration\nspark.sql.hive.convertMetastoreOrc\ndefault is set to\ntrue\n. In older Spark versions, the configuration is set to\nfalse\n.\nThis creates a situation where the Hive interface used at the time of ingesting the Delta table is different than the Hive interface used while reading the same Delta table. This leads to incorrect results while querying the Delta table.\nSolution\nEnsure you use the same Hive interface while ingesting and reading the Delta table.\nIf you ingest data using Spark versions older than 2.4, then also set the following configuration at the cluster level.\n* `spark.sql.hive.convertMetastoreOrc=false`" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-select-a-compute-policy-for-a-dlt-pipeline.json b/scraped_kb_articles/cannot-select-a-compute-policy-for-a-dlt-pipeline.json new file mode 100644 index 0000000000000000000000000000000000000000..5cad30d7a33b78e919a40b7a5db76b018a67f6b1 --- /dev/null +++ b/scraped_kb_articles/cannot-select-a-compute-policy-for-a-dlt-pipeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cannot-select-a-compute-policy-for-a-dlt-pipeline", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are unable to view or select custom policies in the dropdown menu when attempting to apply a policy to a DLT (Delta Live Tables) pipeline.\nCause\nThis is usually caused by:\nYou do not have\nCan Use\npermissions on the policy configuration for the policy type you are trying to use.\nThe policy configuration limits the target compute type to a specific compute and you are trying to use a different compute type.\nSolution\nAsk your Databricks admin to review the compute policy configuration and verify that you have\nCan Use\npermissions on the policy or are part of a group that has\nCan Use\npermissions on the policy.\nReview the compute policy JSON definition and ensure the target compute type is allowed by the policy. If the policy is restricted to a compute type, the compute policy JSON definition has a\ncluster_type\nsection.\n\ncan be\nall-purpose\n,\ndlt\n, or\njob\n.\n\"cluster_type\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"\"\r\n  \t\t\t  }\nIf the\ncluster_type\nsection is not present in the compute policy JSON definition the policy can be used with all compute types.\nReview the\nCompute policy reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information on the supported attributes." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-select-a-databricks-runtime-version-when-using-a-delta-live-tables-pipeline.json b/scraped_kb_articles/cannot-select-a-databricks-runtime-version-when-using-a-delta-live-tables-pipeline.json new file mode 100644 index 0000000000000000000000000000000000000000..b2292bfd130414843031bace853515330dbc0a68 --- /dev/null +++ b/scraped_kb_articles/cannot-select-a-databricks-runtime-version-when-using-a-delta-live-tables-pipeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/cannot-select-a-databricks-runtime-version-when-using-a-delta-live-tables-pipeline", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to select a specific Databricks Runtime version for use with your Delta Live Tables (DLT) pipeline, but you cannot find an option for it in the UI or the API.\nCause\nDelta Live Tables do not allow you to directly configure the Databricks Runtime version.\nDelta Live Tables clusters run on a custom version of the Databricks Runtime that is continually updated to include the latest features.\nSolution\nWhen creating a Delta Live Tables\npipeline\nyou can choose from one of two channels:\nCurrent\nchannel - This is the default choice. It is the current, stable DLT runtime version.\nPreview\nchannel - This allows you to test your pipeline with the newest version of the DLT runtime.\nDelete\nInfo\nThe channel field is optional. If you do not enter a value, the DLT pipeline defaults to the current channel.\nDatabricks recommends using the current channel for production workloads.\nFor more information, please review the creating a new DLT pipeline documentation (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-set-a-custom-pythonpath.json b/scraped_kb_articles/cannot-set-a-custom-pythonpath.json new file mode 100644 index 0000000000000000000000000000000000000000..1ebc007aefa8caed63d23d2d50db8ff0a8173339 --- /dev/null +++ b/scraped_kb_articles/cannot-set-a-custom-pythonpath.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cannot-set-a-custom-pythonpath", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you are trying to set a custom\nPYTHONPATH\nenvironment variable in a cluster-scoped init script, but the values are overridden at driver startup.\nCause\nSetting a custom\nPYTHONPATH\nin an init scripts does not work and is not supported.\nAdditionally, you cannot set a custom\nPYTHONPATH\nwhen using Databricks Container Services.\nSolution\nYou should not try to set a custom\nPYTHONPATH\n.\nIf you need to use custom Python libraries or modules, install the required files to pre-existing directories that are included in the cluster's\nPYTHONPATH\n.\nThis sample code lists all directories in the cluster's PYTHONPATH.\n%python\r\n\r\nimport sys\r\nprint(sys.path)" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-upload-csv-or-sh-file-types-in-workspace-ui.json b/scraped_kb_articles/cannot-upload-csv-or-sh-file-types-in-workspace-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..7c9faf4088a72bf58e574e7f0043c79823f1c235 --- /dev/null +++ b/scraped_kb_articles/cannot-upload-csv-or-sh-file-types-in-workspace-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/cannot-upload-csv-or-sh-file-types-in-workspace-ui", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/cannot-view-table-serde-properties.json b/scraped_kb_articles/cannot-view-table-serde-properties.json new file mode 100644 index 0000000000000000000000000000000000000000..070926c1a6f45493884e41390cc6a0092dc2f199 --- /dev/null +++ b/scraped_kb_articles/cannot-view-table-serde-properties.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/cannot-view-table-serde-properties", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to view the SerDe properties on an Apache Hive table, but\nSHOW CREATE TABLE\njust returns the Apache Spark DDL. It does not show the SerDe properties.\nFor example, given this sample code:\n%sql\r\n\r\nSHOW CREATE TABLE \nYou get a result that does not show the SerDe properties:\nCause\nYou are using Databricks Runtime 7.3 LTS or later, which uses Spark 3.0 and above.\nThe usage of\nSHOW CREATE TABLE\nchanged with Spark 3.0.\nSolution\nTo view a table's SerDe properties in Spark 3.0 and above, you need to add the option\nAS SERDE\nat the end of the\nSHOW CREATE TABLE\ncommand.\nFor example, given this sample code:\nSHOW CREATE TABLE AS SERDE\nYou get a result that shows the table's SerDe properties:" +} \ No newline at end of file diff --git a/scraped_kb_articles/cant-add-members-to-external-groups-using-the-ui.json b/scraped_kb_articles/cant-add-members-to-external-groups-using-the-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..fe13d32203f832460476a043d393caa56b2ab650 --- /dev/null +++ b/scraped_kb_articles/cant-add-members-to-external-groups-using-the-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/cant-add-members-to-external-groups-using-the-ui", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/cant-edit-jobs-created-using-databricks-asset-bundles-dabs-using-the-ui.json b/scraped_kb_articles/cant-edit-jobs-created-using-databricks-asset-bundles-dabs-using-the-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..2b90bf64a274d2389246bff11af9eb5f66f5e62b --- /dev/null +++ b/scraped_kb_articles/cant-edit-jobs-created-using-databricks-asset-bundles-dabs-using-the-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/cant-edit-jobs-created-using-databricks-asset-bundles-dabs-using-the-ui", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice your jobs created using Databricks Asset Bundles (DABs) are not editable in the UI. You’re unable to perform actions like\nEdit trigger\n,\nPause\n, or\nDelete\n. The following image shows these actions grayed out in the UI.\nCause\nJobs managed by DABs are not editable in the UI for security reasons. DABs are an infrastructure-as-code (IaC) approach to managing projects, so the expectation is editing in the UI is not necessary.\nUnder the hood, when deploying Databricks jobs via CLI using DABs, the\nedit_mode\nfield is overridden to be\nUI_LOCKED\n.\nSolution\nThe\nedit_mode\nfield can be changed. You can manually override it, or make job updates programmatically using the API or SDK.\nManually override the edit mode\nAccess the job using the UI and click the\nDisconnect from source\nbutton in the header. In the following image of the UI, the button is highlighted with a green rectangle.\nThen, in the popup that appears, click the red\nDisconnect\nbutton to confirm. This overrides the edit mode in the UI and allow changes. In the following image of the popup, the button is highlighted with a green rectangle.\nMake job updates programmatically using the API or SDK\nAfter deploying a jobs, update it using the API or SDK. Set the\nedit_mode\nfield to\n“EDITABLE”\nwithin the\n“new_settings”\nparameter.\nAPI\nYou can use the following example code. For details, refer to the\nUpdate job settings partially API\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n%sh\r\ncurl -X POST https:///api/2.1/jobs/update \\\\\r\n  -H \"Authorization: Bearer \" \\\\\r\n  -H \"Content-Type: application/json\" \\\\\r\n  -d '{\r\n    \"job_id\": ,\r\n    \"new_settings\": {\r\n      \"edit_mode\": \"EDITABLE\"\r\n    }\r\n  }'\nDatabricks SDK for Python\nYou can use the following example code. For details, refer to the “update” section of the\nw.jobs: Jobs\nDatabricks SDK for Python documentation.\n%python\r\n\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\n# Initialize the client\r\nclient = WorkspaceClient(host='', token='')\r\n\r\n# Update the job settings\r\nclient.jobs.update(\r\n    job_id=,\r\n    new_settings={\r\n        'edit_mode': 'EDITABLE'\r\n    }\r\n)\nFor more information, review the\nDatabricks Asset Bundles resources\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cant-find-where-to-set-email-notifications-for-dab-jobs-in-the-ui.json b/scraped_kb_articles/cant-find-where-to-set-email-notifications-for-dab-jobs-in-the-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..84853ac18b431aab3d2e9c41cbdd068a45d9a3ab --- /dev/null +++ b/scraped_kb_articles/cant-find-where-to-set-email-notifications-for-dab-jobs-in-the-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/cant-find-where-to-set-email-notifications-for-dab-jobs-in-the-ui", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Databricks Asset Bundles (DABs), you want to set two distinct email notifications for when a DAB job is correctly executed and when it fails. You don’t see how to set these notifications in the UI.\nCause\nWhen using a DAB for a job, the UI will not display all properties. Setting email notifications is not available in the UI by design. Instead, it is best to set email configurations using YAML syntax.\nSolution\nTo configure email notifications for\njobs.on_success\nand\njobs.on_failure\nwithin Databricks Asset Bundles (DAB), use the UI to switch to code version and use YAML syntax.\nIn the workspace UI, find your existing DAB job.\nIn the top right button\nRun Now\n, click on the down arrow on the right and choose the option\nSwitch to code version (YAML)\n, as shown in the following image.\nAdd your YAML code to set the notification. The following code is a sample of the YAML configuration that includes email notification for both the job run, under jobs and the task, under tasks.\nresources:\r\n jobs:\r\n  Sample_Job_2025_01_16:\r\n   name: Sample Job 2025-01-16\r\n   email_notifications:\r\n    on_success:\r\n     - \r\n    on_failure:\r\n     - \r\n   tasks:\r\n    - task_key: sample_task\r\n     notebook_task:\r\n      notebook_path: /Workspace/Users//\r\n      source: WORKSPACE\r\n      email_notifications:\r\n       on_success:\r\n        - \r\n       on_failure:\r\n        - \r\n   queue:\r\n    enabled: true" +} \ No newline at end of file diff --git a/scraped_kb_articles/cant-uninstall-libraries.json b/scraped_kb_articles/cant-uninstall-libraries.json new file mode 100644 index 0000000000000000000000000000000000000000..22e4675f7c11a7e33e6ec1e59b3a53bf808e2807 --- /dev/null +++ b/scraped_kb_articles/cant-uninstall-libraries.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/cant-uninstall-libraries", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nUsually, libraries can be uninstalled in the\nClusters\nUI. If the checkbox to select the library is disabled, then it’s not possible to uninstall the library from the UI.\nCause\nIf you create a library using REST API version 1.2 and if auto-attach is enabled, the library is installed on all clusters. In this scenario, the\nClusters\nUI checkbox to select the library to uninstall is disabled.\nSolution\nCreate a workspace library pointing to the DBFS location of the library that you are unable to uninstall.\nExample: You can’t uninstall a JAR library that is available at this DBFS location:\ndbfs:/Filestore/jars/custom_elastic_spark.jar\nCreate a new workspace library pointing to the same DBFS location.\nIn the library UI, select the checkbox to uninstall the library from individual clusters." +} \ No newline at end of file diff --git a/scraped_kb_articles/casting-bignumeric-values-to-decimal38-38-fails-with-numeric_value_out_of_range-error.json b/scraped_kb_articles/casting-bignumeric-values-to-decimal38-38-fails-with-numeric_value_out_of_range-error.json new file mode 100644 index 0000000000000000000000000000000000000000..15191cbfcc225a671a506933ece5153723b44024 --- /dev/null +++ b/scraped_kb_articles/casting-bignumeric-values-to-decimal38-38-fails-with-numeric_value_out_of_range-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/casting-bignumeric-values-to-decimal38-38-fails-with-numeric_value_out_of_range-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen integrating BigQuery with Databricks through Lake Federation, you try to cast BigNumeric values to Databricks\nDecimal(38, 38)\ntype. You receive the following error.\n[NUMERIC_VALUE_OUT_OF_RANGE.WITHOUT_SUGGESTION] \r\narises during federated queries when a BigQuery BigNumeric value (e.g., 66000.00000000000000000000000000000000000000) cannot be cast to Databricks' Decimal(38, 38) type. This occurs because the value exceeds the 38-digit precision limit enforced by Spark.\nCause\nWhen using Lake Federation, BigNumeric values are automatically mapped to\nDecimal(38, 38)\nin Databricks.\nDecimal(38, 38)\nreserves all 38 digits for the fractional part (scale) of the value, leaving no digits for the integer part.\nValues with over 38 digits of precision and non-zero integer digits (for example, 66000.0...) violate the\nDecimal(38,38)\nbehavior, triggering the error.\nSolution\nModify the federated query to first cast the BigNumeric column to\nNUMERIC\n(which has 38-digit precision) in BigQuery, then transfer the data to Databricks.\nSELECT \r\n  CAST( AS NUMERIC) AS  \r\nFROM \r\n  `..`" +} \ No newline at end of file diff --git a/scraped_kb_articles/casting-string-to-datetimestamp-in-dlt-pipeline-does-not-throw-an-error.json b/scraped_kb_articles/casting-string-to-datetimestamp-in-dlt-pipeline-does-not-throw-an-error.json new file mode 100644 index 0000000000000000000000000000000000000000..a2a35915793817b5227019016fe95108ac838b76 --- /dev/null +++ b/scraped_kb_articles/casting-string-to-datetimestamp-in-dlt-pipeline-does-not-throw-an-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/casting-string-to-datetimestamp-in-dlt-pipeline-does-not-throw-an-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Delta Live Tables, you\nshould\nencounter an issue when a string value is cast to a date/timestamp datatype, but do not.\nExample\nA sample DLT pipeline definition:\n@dlt.table(\r\n    name='test',\r\n    temporary=True\r\n)\r\ndef df_src():\r\n    return (\r\n        spark.sql(\"select 101 as id, cast('test' as Date) as dt from \")\r\n    )\nCause\nThe Delta Live Tables pipeline inserts\nNULL\ns in the target table for each string cast to a date/timestamp datatype instead of failing at the analysis phase.\nNote\nThis occurs in Delta Live Tables in both\nCURRENT\nand\nPREVIEW\nchannels but not on Databricks Runtime.\nSolution\nConfigure the Delta Live Tables pipeline to enforce ANSI SQL compliance by setting the\nspark.sql.ansi.enabled\nparameter to\ntrue\n.\nspark.conf.set(\"spark.sql.ansi.enabled\", \"true\")\nAs an alternative, you can add this configuration in your pipeline settings by clicking\nAdd Configuration\nunder the\nAdvanced\nsection.\nThe pipeline now fails as it should.\norg.apache.spark.SparkDateTimeException: [CAST_INVALID_INPUT]\nThe value\ntest\nof the type\nSTRING\ncannot be cast to\nDATE\nbecause it is malformed. Correct the value as per the syntax, or change its target type. Use\ntry_cast\nto tolerate malformed input and return\nNULL\ninstead. If necessary, set\nspark.sql.ansi.enabled\nto\nfalse\nto bypass this error.\nselect 101 as id, cast('test' as Date) as dt from " +} \ No newline at end of file diff --git a/scraped_kb_articles/chained-transformations.json b/scraped_kb_articles/chained-transformations.json new file mode 100644 index 0000000000000000000000000000000000000000..b3ddb0f5913977d9f11be2e5b8714ffac7cfd8ca --- /dev/null +++ b/scraped_kb_articles/chained-transformations.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/chained-transformations", + "title": "Título do Artigo Desconhecido", + "content": "Sometimes you may need to perform multiple transformations on your DataFrame:\n%scala\r\nimport org.apache.spark.sql.functions._\r\nimport org.apache.spark.sql.DataFrame\r\n\r\nval testDf = (1 to 10).toDF(\"col\")\r\n\r\ndef func0(x: Int => Int, y: Int)(in: DataFrame): DataFrame = {\r\n  in.filter('col > x(y))\r\n}\r\ndef func1(x: Int)(in: DataFrame): DataFrame = {\r\n  in.selectExpr(\"col\", s\"col + $x as col1\")\r\n}\r\ndef func2(add: Int)(in: DataFrame): DataFrame = {\r\n  in.withColumn(\"col2\", expr(s\"col1 + $add\"))\r\n}\nWhen you apply these transformations, you may end up with spaghetti code like this:\n%scala\r\n\r\ndef inc(i: Int) = i + 1\r\n\r\nval tmp0 = func0(inc, 3)(testDf)\r\nval tmp1 = func1(1)(tmp0)\r\nval tmp2 = func2(2)(tmp1)\r\nval res = tmp2.withColumn(\"col3\", expr(\"col2 + 3\"))\nThis article describes several methods to simplify chained transformations.\nDataFrame\ntransform\nAPI\nTo benefit from the functional programming style in Spark, you can leverage the DataFrame\ntransform\nAPI, for example:\n%scala\r\n\r\nval res = testDf.transform(func0(inc, 4))\r\n                .transform(func1(1))\r\n                .transform(func2(2))\r\n                .withColumn(\"col3\", expr(\"col2 + 3\"))\nFunction.chain\nAPI\nTo go even further, you can leverage the Scala Function library, to\nchain\nthe transformations, for example:\n%scala\r\n\r\nval chained = Function.chain(List(func0(inc, 4)(_), func1(1)(_), func2(2)(_)))\r\nval res = testDf.transform(chained)\r\n                .withColumn(\"col3\", expr(\"col2 + 3\"))\nimplicit\nclass\nAnother alternative is to define a Scala\nimplicit\nclass, which allows you to eliminate the DataFrame\ntransform\nAPI:\n%scala\r\n\r\nimplicit class MyTransforms(df: DataFrame) {\r\n    def func0(x: Int => Int, y: Int): DataFrame = {\r\n        df.filter('col > x(y))\r\n    }\r\n    def func1(x: Int): DataFrame = {\r\n        df.selectExpr(\"col\", s\"col + $x as col1\")\r\n    }\r\n    def func2(add: Int): DataFrame = {\r\n        df.withColumn(\"col2\", expr(s\"col1 + $add\"))\r\n    }\r\n}\nThen you can call the functions directly:\n%scala\r\n\r\nval res = testDf.func0(inc, 1)\r\n            .func1(2)\r\n            .func2(3)\r\n            .withColumn(\"col3\", expr(\"col2 + 3\"))" +} \ No newline at end of file diff --git a/scraped_kb_articles/change-cluster-config-for-delta-live-table-pipeline.json b/scraped_kb_articles/change-cluster-config-for-delta-live-table-pipeline.json new file mode 100644 index 0000000000000000000000000000000000000000..ac22b118b4bea067120687ed2d034f2c75f7fdd7 --- /dev/null +++ b/scraped_kb_articles/change-cluster-config-for-delta-live-table-pipeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/change-cluster-config-for-delta-live-table-pipeline", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Delta Live Tables and want to change the cluster configuration.\nYou create a pipeline, but only have options to enable or disable Photon and select the number of workers.\nCause\nWhen you create a Delta Live Table pipeline, most parameters are configured with default values. These values cannot be configured before the pipeline is created.\nSolution\nYou can change the cluster configuration after the pipeline is created.\nClick\nWorkflows\nin the sidebar.\nClick the\nDelta Live Tables\ntab.\nClick the name of your pipeline.\nClick\nSettings\n.\nOn the\nEdit Pipeline Settings\npop-up, click\nJSON.\nEdit the JSON to specify your cluster configuration. You can update all of the Delta Live Table settings (\nAWS\n|\nAzure\n|\nGCP\n) in the JSON file.\nClick\nSave\n.\nClick\nStart\nto start your pipeline with the new cluster configuration." +} \ No newline at end of file diff --git a/scraped_kb_articles/change-r-version.json b/scraped_kb_articles/change-r-version.json new file mode 100644 index 0000000000000000000000000000000000000000..77f4ff98ef241e53d548ba268d186d97ff7e5543 --- /dev/null +++ b/scraped_kb_articles/change-r-version.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/change-r-version", + "title": "Título do Artigo Desconhecido", + "content": "These instructions describe how to install a different version of R (r-base) on a cluster. You can check the default r-base version that each Databricks Runtime version is installed with in the System environment section of each Databricks Runtime release note (\nAWS\n|\nAzure\n|\nGCP\n).\nList available r-base-core versions\nTo list the versions of r-base-core that can be installed and the version format:\nPaste the following shell command in a notebook cell:\n%sh\r\n\r\nadd-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial//'\r\napt-get -y update\r\napt-cache madison r-base-core\nRun the cell.\nFor example, you can install version 3.3.3 by specifying\n3.3.3-1xenial0\n.\nInstall a specific R version\nPaste the following shell command into a notebook cell. Set\n\nto the R version to be installed. Set\n\nto a file path under\n/dbfs\nwhere this init script will be saved.\n%sh\r\n\r\nR_VERSION=''\r\nINIT_SCRIPT_PATH=''\r\n\r\nmkdir -p $(dirname $INIT_SCRIPT_PATH)\r\n\r\necho \"set -e\r\n\r\n# Add the repository containing another version of R\r\nadd-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial//'\r\napt-get -y update\r\n\r\n# Uninstall current R version\r\napt-get remove -y r-base-core\r\n\r\n# Install another version of R\r\napt-get install -y r-base-core=$R_VERSION\r\n\r\n# Must install Rserve to use Databricks notebook\r\nR -e \\\"install.packages('Rserve', repos='https://rforge.net/', type = 'source')\\\"\r\nR -e \\\"install.packages('hwriterPlus', repos='https://mran.revolutionanalytics.com/snapshot/2017-02-26')\\\"\" > $INIT_SCRIPT_PATH\nRun the notebook cell to save the init script to a file on DBFS.\nConfigure a cluster with a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n). When specifying the init script path in the cluster creation UI, modify the format of the init script path to change\n/dbfs\nto\ndbfs:/\n. For example, if\n\nis set to\n/dbfs/examplepath/change-r-base.sh\n, then in the cluster creation UI specify the init script path\ndbfs:/examplepath/change-r-base.sh\n.\nAfter the cluster starts up, verify that the desired R version is installed by running\n%r R.version\nin a notebook cell." +} \ No newline at end of file diff --git a/scraped_kb_articles/change-the-minor-version-of-python-in-a-cluster.json b/scraped_kb_articles/change-the-minor-version-of-python-in-a-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..2b5629e1cdaab3c0eb92bebb111af87a5cdb1a42 --- /dev/null +++ b/scraped_kb_articles/change-the-minor-version-of-python-in-a-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/change-the-minor-version-of-python-in-a-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to change the minor version of Python that is included with the version of Databricks Runtime you have selected.\nInfo\nThis method only allows you to update the\nminor\nversion of Python. You cannot update the\nmajor\nversion.\nFor example, you can update Python 3.11.0 to Python 3.11.11.\nYou cannot update Python 3.11.x to Python 3.12.x.\nCause\nEvery version of Databricks Runtime ships with a specific version of Python. You can see the included Python version by reviewing the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n|\nAzure\n|\nGCP\n) for your selected Databricks Runtime.\nOpen the release notes for your selected Databricks Runtime and review the\nSystem environment\nsection.\nYou may have a specific situation where you want to update the Python version, but keep your selected Databricks Runtime.\nSolution\nYou can use a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n) to install an updated version of Python on your cluster when it starts.\nThis example init script uses the\n“deadsnakes” repository\nto install Python and the\npyenv\nGithub repo\nto install the corresponding version of\npyenv\n.\nInfo\nYou can use any Python repository (including an internal one) to install the Python binaries. The “deadsnakes” repository used here is one of many available Python sources.\nYou will need to specify the version of Python and the version of\npyenv\nbefore running the init script.\n#!/bin/bash\r\n\r\nadd-apt-repository -y ppa:deadsnakes/ppa\r\napt-get update --allow-releaseinfo-change-origin\r\nDEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::=\"--force-confdef\" -o Dpkg::Options::=\"--force-confold\" install \r\nDEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::=\"--force-confdef\" -o Dpkg::Options::=\"--force-confold\" --fix-broken install\r\n\r\nwget https://github.com/pyenv/pyenv/archive/refs/tags/.tar.gz -O pyenv.tar.gz \\\r\n&& tar -xvf pyenv.tar.gz --strip-components 1 -C /databricks/.pyenv \\\r\n&& rm pyenv.tar.gz\nImportant\nWhen using standard (formerly shared) access mode clusters, you must add\n_PIP_USE_IMPORTLIB_METADATA=false\nto the cluter's\nSpark config\n. This is required for library installations to work.\nThis init script also does not change the UDF Python version on standard access mode clusters, as init scripts are not applied on UDF workers.\nExample - Install Python 3.11.11 and\npyenv\n2.5.0\nThis example code builds on the above sample to install the latest version of Python 3.11.x and\npyenv\n2.5.0.\n#!/bin/bash\r\n\r\n# install the latest python 3.11 version\r\nadd-apt-repository -y ppa:deadsnakes/ppa\r\napt-get update --allow-releaseinfo-change-origin\r\nDEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::=\"--force-confdef\" -o Dpkg::Options::=\"--force-confold\" install python3.11\r\nDEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::=\"--force-confdef\" -o Dpkg::Options::=\"--force-confold\" --fix-broken install\r\n\r\n# install pyenv 2.5.0 that supports python 3.11.11\r\nwget https://github.com/pyenv/pyenv/archive/refs/tags/v2.5.0.tar.gz -O pyenv.tar.gz \\\r\n&& tar -xvf pyenv.tar.gz --strip-components 1 -C /databricks/.pyenv \\\r\n&& rm pyenv.tar.gz" +} \ No newline at end of file diff --git a/scraped_kb_articles/changing-ansi-mode-inline-doesnt-work-when-querying-a-view.json b/scraped_kb_articles/changing-ansi-mode-inline-doesnt-work-when-querying-a-view.json new file mode 100644 index 0000000000000000000000000000000000000000..a571d561e0f6cbd39326e751e0731b85c0a261fb --- /dev/null +++ b/scraped_kb_articles/changing-ansi-mode-inline-doesnt-work-when-querying-a-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/changing-ansi-mode-inline-doesnt-work-when-querying-a-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen querying a view, you notice unexpected behavior related to ANSI mode. Particularly, when you change the\nspark.sql.ansi.enabled\nconfiguration inline, the change does not reflect in the query execution.\nExample\nIn the following code, you create a table and then a view with ANSI mode enabled initially. You\nselect c1, c1 / 0 as c2\n, which is not ANSI compliant. When you try to\nSELECT * FROM\nthe view, it fails as expected with a\nDIVIDE_BY_ZERO\nerror.\nYou then disable ANSI mode inline, and run\nSELECT * FROM\nthe view again. The query still fails, but you expect it to succeed.\n%sql\r\n-- create a table\r\nCREATE OR REPLACE TABLE (c1 int);\r\nINSERT INTO VALUES (1);\r\n\r\n-- create a view with ANSI mode enabled\r\nSET spark.sql.ansi.enabled=true;\r\nCREATE OR REPLACE VIEW AS\r\nSELECT\r\n  c1,\r\n  c1 / 0 as c2 -- ℹ️ this does *not* comply with ANSI standards so querying this view with ANSI enabled should throw an exception. \r\nFROM ;\r\n\r\n--\r\n-- test #1\r\n--\r\n-- expectation: this should fail given that ANSI mode is *enabled*\r\nSELECT * FROM ;\r\n-- result: ❌ throws a `DIVIDE_BY_ZERO` error as expected\r\n\r\n--\r\n-- test #2\r\n--\r\nSET spark.sql.ansi.enabled=false;\r\n-- expectation: this should succeed given that ANSI mode is *disabled*\r\nSELECT * FROM ;\r\n-- result: ❌ throws a `DIVIDE_BY_ZERO` error. this is unexpected\nNote\nYou may observe similar behavior with\nALTER VIEW\n.\nCause\nThe ANSI session level configuration set at the time of view creation persists with the view definition.\nSolution\nRecreate the view with the desired ANSI mode.\nIf you prefer not to recreate the view, you can force ANSI exceptions inline using functions such as\ntry_cast\n,\ntry_add\n, or\ntry_divide\nin the non-ANSI operations. For more information, refer to the\ntry_cast function\n,\ntry_add function\n, and\ntry_divide function\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/charconversionexception-when-importing-non-udf-data-from-ibm-db2-to-databricks.json b/scraped_kb_articles/charconversionexception-when-importing-non-udf-data-from-ibm-db2-to-databricks.json new file mode 100644 index 0000000000000000000000000000000000000000..c52b75a4a185cfe2d722bbf10898d32c337f15f5 --- /dev/null +++ b/scraped_kb_articles/charconversionexception-when-importing-non-udf-data-from-ibm-db2-to-databricks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/charconversionexception-when-importing-non-udf-data-from-ibm-db2-to-databricks", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen importing views from IBM Db2 using Apache Spark, you encounter the following error in the Spark driver logs or job failure details.\nCaught java.io.CharConversionException ERRORCODE=-4220, SQLSTATE=null\nCause\nThe IBM Db2 JCC (JDBC) driver expects character column data to conform to the database's UTF-8 code page. If any column contains invalid or malformed UTF-8 byte sequences (for example, data with characters beyond the valid Unicode range or incorrectly encoded), the driver throws a\nSqlException\nwrapping a\njava.io.CharConversionException\n.\nExample stack trace\norg.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by:com.ibm.db2.jcc.am.SqlException: [jcc][t4][XX][XX][X.X.X] Caught java.io.CharConversionException. See attached Throwable for details. ERRORCODE=-4220,SQLSTATE=null at com.ibm.db2.jcc.am.fd.a(fd.java:731)\nSolution\nTo handle extended or invalid characters more gracefully, configure your cluster with the following Spark configurations so they apply to all notebooks and jobs on that cluster. These settings modify the Db2 JCC driver’s behavior to tolerate character encoding issues without failing the entire query.\nspark.driver.extraJavaOptions -Ddb2.jcc.charsetDecoderEncoder=3\r\nspark.executor.extraJavaOptions -Ddb2.jcc.charsetDecoderEncoder=3\nFor details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor additional reference on shared Db2 driver properties, see IBM’s\nJDBC throws\njava.io\n.CharConversionException\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/check-spark-property-modifiable.json b/scraped_kb_articles/check-spark-property-modifiable.json new file mode 100644 index 0000000000000000000000000000000000000000..c9d3a92dd1d1282a4bddec666709bb16918d59aa --- /dev/null +++ b/scraped_kb_articles/check-spark-property-modifiable.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/check-spark-property-modifiable", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou can tune applications by setting various configurations. Some configurations must be set at the cluster level, whereas some are set inside notebooks or applications.\nSolution\nTo check if a particular Spark configuration can be set in a notebook, run the following command in a notebook cell:\n%scala\r\n\r\nspark.conf.isModifiable(\"spark.databricks.preemption.enabled\")\nIf\ntrue\nis returned, then the property can be set in the notebook. Otherwise, it must be set at the cluster level." +} \ No newline at end of file diff --git a/scraped_kb_articles/checkpoint-no-cleanup-display.json b/scraped_kb_articles/checkpoint-no-cleanup-display.json new file mode 100644 index 0000000000000000000000000000000000000000..16302721549f5c421eae94b0508fe4043e540acc --- /dev/null +++ b/scraped_kb_articles/checkpoint-no-cleanup-display.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/checkpoint-no-cleanup-display", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a streaming job using\ndisplay()\nto display DataFrames.\n%scala\r\n\r\nval streamingDF = spark.readStream.schema(schema).parquet()\r\ndisplay(streamingDF)\nCheckpoint files are being created, but are not being deleted.\nYou can verify the problem by navigating to the root directory and looking in the\n/local_disk0/tmp/\nfolder. Checkpoint files remain in the folder.\nCause\nThe command\ndisplay(streamingDF)\nis a memory sink implementation that can display the data from the streaming DataFrame for every micro-batch. A checkpoint directory is required to track the streaming updates.\nIf you have not specified a custom checkpoint location, a default checkpoint directory is created at\n/local_disk0/tmp/\n.\nDatabricks uses the checkpoint directory to ensure correct and consistent progress information. When a stream is shut down, either purposely or accidentally, the checkpoint directory allows Databricks to restart and pick up exactly where it left off.\nIf a stream is shut down by cancelling the stream from the notebook, the Databricks job attempts to clean up the checkpoint directory on a best-effort basis. If the stream is terminated in any other way, or if the job is terminated, the checkpoint directory is not cleaned up.\nThis is as designed.\nSolution\nYou can prevent unwanted checkpoint files with the following guidelines.\nYou should not use\ndisplay(streamingDF)\nin production jobs.\nIf\ndisplay(streamingDF)\nis mandatory for your use case, you should manually specify the checkpoint directory by using the Apache Spark config option\nspark.sql.streaming.checkpointLocation\n.\nIf you manually specify the checkpoint directory, you should periodically delete any remaining files in this directory. This can be done on a weekly basis." +} \ No newline at end of file diff --git a/scraped_kb_articles/checkpoint-no-cleanup-foreachbatch.json b/scraped_kb_articles/checkpoint-no-cleanup-foreachbatch.json new file mode 100644 index 0000000000000000000000000000000000000000..caf6e690cc0439b193e091a8b5605f334e02a2d3 --- /dev/null +++ b/scraped_kb_articles/checkpoint-no-cleanup-foreachbatch.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/checkpoint-no-cleanup-foreachbatch", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a streaming job using\nforeachBatch()\nto process DataFrames.\n%scala\r\n\r\nstreamingDF.writeStream.outputMode(\"append\").foreachBatch { (batchDF: DataFrame, batchId: Long) =>\r\n  batchDF.write.format(\"parquet\").mode(\"overwrite\").save(output_directory)\r\n}.start()\nCheckpoint files are being created, but are not being deleted.\nYou can verify the problem by navigating to the root directory and looking in the\n/local_disk0/tmp/\nfolder. Checkpoint files remain in the folder.\nCause\nThe command\nforeachBatch()\nis used to support DataFrame operations that are not normally supported on streaming DataFrames. By using\nforeachBatch()\nyou can apply these operations to every micro-batch. This requires a checkpoint directory to track the streaming updates.\nIf you have not specified a custom checkpoint location, a default checkpoint directory is created at\n/local_disk0/tmp/\n.\nDatabricks uses the checkpoint directory to ensure correct and consistent progress information. When a stream is shut down, either purposely or accidentally, the checkpoint directory allows Databricks to restart and pick up exactly where it left off.\nIf a stream is shut down by cancelling the stream from the notebook, the Databricks job attempts to clean up the checkpoint directory on a best-effort basis. If the stream is terminated in any other way, or if the job is terminated, the checkpoint directory is not cleaned up.\nThis is as designed.\nSolution\nYou should manually specify the checkpoint directory with the\ncheckpointLocation\noption.\n%scala\r\n\r\nstreamingDF.writeStream.option(\"checkpointLocation\",\"\").outputMode(\"append\").foreachBatch { (batchDF: DataFrame, batchId: Long) =>\r\nbatchDF.write.format(\"parquet\").mode(\"overwrite\").save(output_directory)\r\n}.start()" +} \ No newline at end of file diff --git a/scraped_kb_articles/circular-reference-error-when-trying-to-apply-row-filter-function-in-databricks-sql-analytics.json b/scraped_kb_articles/circular-reference-error-when-trying-to-apply-row-filter-function-in-databricks-sql-analytics.json new file mode 100644 index 0000000000000000000000000000000000000000..827750ce9cb07460521c3a8ceeefbbf282ffbb8a --- /dev/null +++ b/scraped_kb_articles/circular-reference-error-when-trying-to-apply-row-filter-function-in-databricks-sql-analytics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/circular-reference-error-when-trying-to-apply-row-filter-function-in-databricks-sql-analytics", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to assign a row filter function to a table in Databricks SQL analytics, you receive an error message.\nErrorClass=INVALID_PARAMETER_VALUE.UC_DEPENDENCY_DEPTH_LIMIT_EXCEEDED] Table '' depth exceeds limit (or has a circular reference).\nCause\nThe table has a row filter which itself contains a reference to the same table (a circular reference). This configuration is not permitted because it would cause an infinite loop.\nSolution\nIdentify and remove the circular reference in the row filter configuration. Ensure that the row filter does not reference the table being used.\nReview and update the row filter logic to avoid any self-referencing configurations.\nTest the updated row filter configuration to ensure that the query executes successfully without encountering the circular reference error.\nFor further guidance, refer to the\nFilter sensitive table data using row filters and column masks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/classnotfoundexception-error-when-executing-a-job-or-notebook-with-a-custom-kryo-serializer.json b/scraped_kb_articles/classnotfoundexception-error-when-executing-a-job-or-notebook-with-a-custom-kryo-serializer.json new file mode 100644 index 0000000000000000000000000000000000000000..7f05c294ac07e0a41c6748fee1f47d573ead8e22 --- /dev/null +++ b/scraped_kb_articles/classnotfoundexception-error-when-executing-a-job-or-notebook-with-a-custom-kryo-serializer.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/classnotfoundexception-error-when-executing-a-job-or-notebook-with-a-custom-kryo-serializer", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou build a JAR file defining a custom Kryo serializer and install the file in your cluster's libraries using the API or the UI. Then you add the\nspark.serializer org.apache.spark.serializer.KryoSerializer\nand\nspark.kryo.registrator \nApache Spark properties to your cluster's configuration.\nWhen you then try to execute a job or notebook, it fails with a\nClassNotFoundException\nerror.\nCause\nWhen you install a custom library and define a custom serializer on a cluster, the library is loaded on the driver but not the executor upon starting. However, the library is made available for the executor to install.\nWhen the first task that needs the library is triggered on the executor, the executor tries to pull/install the library from the driver instead. This behavior is specific to when the task is serialized using\n\n, which is part of your library.\nNote\nWhen using Photon, executors allocated at cluster start are able to fetch the library and deserialize the task, but executors allocated dynamically still give the\nClassNotFoundException\nerror.\nSolution\nThere are two options available. The first option is to create an init script using the following steps.\n1. Instead of installing a library on the cluster using the configurations, upload the JAR file with the custom Kryo classes to your workspace file system or volume.\n2. Create the below init script using the JAR file path from the previous step.\n#!/bin/sh\r\ncp /databricks/jars/\n3. Add the init script from the previous step to the cluster configurations under the\nAdvanced options > Init Scripts\ntab.\n4. Make sure the custom Kryo serializer configuration is still in place. In the same\nAdvanced options\nspace, click the\nSpark\ntab and verify your code is in the\nSpark config\nbox.\n5. Restart the cluster.\nAlternatively, use the following property in the Spark configurations on your compute.\nspark.jars \nImportant\nIn this option, the JAR may fail to download if you do not have access to the location, or in cases of compute using legacy credential passthrough." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster%E2%80%99s-apache-spark-ui-not-appearing.json b/scraped_kb_articles/cluster%E2%80%99s-apache-spark-ui-not-appearing.json new file mode 100644 index 0000000000000000000000000000000000000000..f181ff79f96dc8659fbf2dcc7f6aa0bc3eb0308b --- /dev/null +++ b/scraped_kb_articles/cluster%E2%80%99s-apache-spark-ui-not-appearing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster%E2%80%99s-apache-spark-ui-not-appearing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour cluster’s Apache Spark UI is not available, and you see the following error message.\nCould not find data to load UI for driver in cluster \nCause\nThis message can appear when you have a custom Spark config,\nspark.extraListeners\noverwriting the default Databricks daemon listener\ncom.databricks.backend.daemon.driver.DBCEventLoggingListener\n.\nSolution\nOpen the affected cluster.\nClick the\nEdit\nbutton.\nScroll to\nAdvanced options\nand click to expand.\nClick the\nSpark\noption in the vertical menu.\nWithin the\nSpark config\nfield, select\nspark.extraListeners\n.\nAppend the default Databricks daemon listener to your custom listener. You can use the following example code.\n, com.databricks.backend.daemon.driver.DBCEventLoggingListener\nIf this Spark setting does not appear in the Spark config field, check any of the init scripts influencing the cluster and adjust using the init script. You can use the following example code.\n\"spark.extraListeners\" = \", com.databricks.backend.daemon.driver.DBCEventLoggingListener\"\nIf the further init script check doesn’t reveal any script influencing the cluster, contact Support to determine alternative causes." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-does-not-support-jobs-workload-error-during-notebook-or-job-run.json b/scraped_kb_articles/cluster-does-not-support-jobs-workload-error-during-notebook-or-job-run.json new file mode 100644 index 0000000000000000000000000000000000000000..af72d3cbc956bfbb30192837db77e58a87adc57c --- /dev/null +++ b/scraped_kb_articles/cluster-does-not-support-jobs-workload-error-during-notebook-or-job-run.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/cluster-does-not-support-jobs-workload-error-during-notebook-or-job-run", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running notebooks or jobs, you receive an error message stating the cluster doesn’t support jobs workload.\ncom.databricks.WorkflowException: com.databricks.common.client.DatabricksServiceHttpClientException: INVALID_PARAMETER_VALUE: The cluster does not support jobs workload\nCause\nYou have a cluster policy that prevents the use of the dbutils.notebooks.run API, which is used to run notebooks as ephemeral jobs.\nThe relevant policy block is:\n\"workload_type.clients.jobs\": {\r\n\"type\": \"fixed\",\r\n\"value\": false\r\n}\nSolution\nIf possible, remove or modify the block that restricts jobs workload in your existing cluster policy. Edit the cluster policy JSON to remove the following code.\n\"workload_type.clients.jobs\": {\r\n\"type\": \"fixed\",\r\n\"value\": false\r\n}\nIf modifying your existing policy is not feasible, apply a different cluster policy that doesn’t have the restrictive block. This may involve creating a new policy or using a different, existing policy that allows jobs workload.\nAlternatively, run the code directly within a notebook to avoid the API.\nFor more information, refer to the\nCompute policy definition\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-failed-launch.json b/scraped_kb_articles/cluster-failed-launch.json new file mode 100644 index 0000000000000000000000000000000000000000..adbd243c5cf7862050bab2b3aebc1b71793abab1 --- /dev/null +++ b/scraped_kb_articles/cluster-failed-launch.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-failed-launch", + "title": "Título do Artigo Desconhecido", + "content": "Update\nTable of Contents\nCluster timeout\nGlobal or cluster-specific init scripts\nToo many libraries installed in cluster UI\nCloud provider limit\nCloud provider shutdown\nInstances unreachable (Azure)\nThis article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs.\nCluster timeout\nError messages:\nDriver failed to start in time\r\nINTERNAL_ERROR: The Spark driver failed to start within 300 seconds\r\nCluster failed to be healthy within 200 seconds\nCause\nThe cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the Hive metastore libraries from a Maven repo. A cluster downloads almost 200 JAR files, including dependencies. If the Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster launch fails. This can occur because JAR downloading is taking too much time.\nSolution\nStore the Hive libraries in DBFS and access them locally from the DBFS location. See\nSpark Options\n.\nGlobal or cluster-specific init scripts\nError message:\nThe cluster could not be started in 50 minutes. Cause: Timed out with exception after attempts\nCause\nInit scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker machine to run the scripts locally. All RPCs must return their status before the process continues. If any RPC hits an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout can be hit, causing the cluster setup job to fail.\nSolution\nUse a\ncluster-scoped init script\ninstead of global or cluster-named init scripts. With cluster-scoped init scripts, Databricks does not use synchronous blocking of RPCs to fetch init script execution status.\nToo many libraries installed in cluster UI\nError message:\nLibrary installation timed out after 1800 seconds. Libraries that are not yet installed:\nCause\nThis is usually an intermittent problem due to network problems.\nSolution\nUsually you can fix this problem by re-running the job or restarting the cluster.\nThe library installer is configured to time out after 3 minutes. While fetching and installing jars, a timeout can occur due to network problems. To mitigate this issue, you can download the libraries from Maven to a DBFS location and install it from there.\nCloud provider limit\nError message:\nCluster terminated. Reason: Cloud Provider Limit\nCause\nThis error is usually returned by the cloud provider.\nSolution\nSee the cloud provider error information in\ncluster unexpected termination\n.\nCloud provider shutdown\nError message:\nCluster terminated. Reason: Cloud Provider Shutdown\nCause\nThis error is usually returned by the cloud provider.\nSolution\nSee the cloud provider error information in\ncluster unexpected termination\n.\nInstances unreachable (Azure)\nError message:\nCluster terminated. Reason: Instances Unreachable\r\nAn unexpected error was encountered while setting up the cluster. Please retry and contact Azure Databricks if the problem persists. Internal error message: Timeout while placing node\nCause\nThis error is usually returned by the cloud provider. Typically, it occurs when you have an Azure Databricks workspace\ndeployed to your own virtual network (VNet)\n(as opposed to the default VNet created when you launch a new Azure Databricks workspace). If the virtual network where the workspace is deployed is already peered or has an ExpressRoute connection to on-premises resources, the virtual network cannot make an ssh connection to the cluster node when Azure Databricks is attempting to create a cluster.\nSolution\nAdd a user-defined route (UDR) to give the Azure Databricks control plane ssh access to the cluster instances, Blob Storage instances, and artifact resources. This custom UDR allows outbound connections and does not interfere with cluster creation. For detailed UDR instructions, see\nStep 3: Create user-defined routes and associate them with your Azure Databricks virtual network subnets\n. For more VNet-related troubleshooting information, see\nTroubleshooting\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-fails-to-initialize-after-a-databricks-runtime-upgrade.json b/scraped_kb_articles/cluster-fails-to-initialize-after-a-databricks-runtime-upgrade.json new file mode 100644 index 0000000000000000000000000000000000000000..62aa139c690eb71d5107f0c7d975f681cd415d5a --- /dev/null +++ b/scraped_kb_articles/cluster-fails-to-initialize-after-a-databricks-runtime-upgrade.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-fails-to-initialize-after-a-databricks-runtime-upgrade", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter running a Databricks Runtime upgrade on a cluster, you receive an initialization failure with error message.\nSpark error: Spark encountered an error on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists. Internal error message: Spark error: Driver down cause: driver state change (exit code: 134)\nCause\nThe Databricks Runtime upgrade may have different Apache Spark configurations than the version you were using before. For example,\nspark.shuffle.spill\nis deprecated in Databricks Runtime 16.1.\nThere may also be inconsistencies with the init scripts set within the updated Databricks Runtime, such as corrupted files, unsupported library dependencies, or inaccessible files.\nSolution\nStart by checking the init script.\nRemove the init script from your cluster configuration.\nRestart the cluster.\nIf the cluster starts without the init script, there is a problem with the init script to investigate further.\nFirst determine if the file is reachable. When accessing the file using a workspace in a volume or Databricks File System (DBFS), make sure the filepath still exists and the permissions are properly set.\nIf the file is reachable, review the init script content to check for dependency conflicts in the libraries included in the script. You can either attempt to install by writing the same code in a notebook using the same cluster configuration or look at your driver logs for error messages indicating library issues.\nIf the init script worked previously with an older Databricks Runtime version, test if moving to this previous Databricks Runtime version works as expected.\nIf the init script works, there may be configurations or dependencies in the script that are not applicable to the current Databricks Runtime.\nIf the cluster does not start after removing the init script, the issue is related to Spark configuration.\nTo check your Spark configuration:\nReview the driver logs for errors in\nlog4j\n.\nSparkExecuteStatementOperation\nin particular typically indicates which Spark configuration module is failing.\nRemove the failing Spark configuration.\nRestart the cluster.\nYou can also generally remove your Spark configurations and add them back individually to test.\nFor more information about configurations, dependencies, and changes in a given Databricks Runtime, review the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-fails-to-launch-with-a-bootstrap-timeout-error.json b/scraped_kb_articles/cluster-fails-to-launch-with-a-bootstrap-timeout-error.json new file mode 100644 index 0000000000000000000000000000000000000000..fa5c31635f7a37d5096b54cb54538c28be27ca0f --- /dev/null +++ b/scraped_kb_articles/cluster-fails-to-launch-with-a-bootstrap-timeout-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-fails-to-launch-with-a-bootstrap-timeout-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nClusters in your workspace are failing to launch with a\nBootstrap Timeout\nerror message.\nCause\nThis issue can occur due to any one of the following reasons:\nFirewall restrictions\nFirewall throttling\nIncorrect virtual network configuration\nSolution\nCheck the\nDatabricks service status page\n(\nAWS\n|\nAzure\n|\nGCP\n) for any known issues in your region.\nVerify that all required Databricks FQDNs and IPs are allowlisted in your VPC. Make sure to allowlist:\nControl plane IPs\nMetastore\nArtifact blob storage primary\nArtifact blob storage secondary\nSystem tables storage\nLog blob storage\nThe event hubs endpoint from the documentation.\nFor more details, review the\nIP addresses and domains for Databricks services and assets\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nVerify that you have met all the requirements for customer-managed VPC/VNet for all of the subnets you're using with Databricks. For more details, review the\nConfigure a customer-managed VPC/Azure virtual network (VNet injection)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCheck with your internal infrastructure team to determine if there are any throttling issues observed in the firewall.\nIn AWS, if your workspace has a private link enabled, review the details in the\nEnable private connectivity using AWS PrivateLink\ndocumentation to verify that the DNS is enabled and it is in the approved state." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-fails-to-launch-with-error-user-specified-an-invalid-argument.json b/scraped_kb_articles/cluster-fails-to-launch-with-error-user-specified-an-invalid-argument.json new file mode 100644 index 0000000000000000000000000000000000000000..36941c9dd080755e2937f3cf79e18e0c528e5e20 --- /dev/null +++ b/scraped_kb_articles/cluster-fails-to-launch-with-error-user-specified-an-invalid-argument.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-fails-to-launch-with-error-user-specified-an-invalid-argument", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour cluster fails to start with an invalid argument message,\n“Cannot launch the cluster because the user specified an invalid argument.”\nThe notice is accompanied by an internal error message.\nINVALID_PARAMETER_VALUE: Assigned user on this cluster does not exist in the workspace anymore.\nCause\nThe cluster owner is no longer active in the workspace.\nSolution\nChange the owner of the cluster to an active user.\nFirst, find the cluster creator from the\ncreator_user_name\nparameter.  To use the API, refer to the\nGet cluster info\nAPI documentation.\nAlternatively, using the UI:\nIn your workspace, navigate to\nCompute\n.\nFind and click your cluster to open it.\nIn the upper right corner, click the kebab menu.\nChoose\nView JSON\n.\nIn the JSON file, locate the\ncreator_user_name\nvalue.\nThen, confirm that the user is not present in the workspace.\nLast, using the API, change the cluster owner to a user in the workspace. For details, refer to the\nChange cluster owner\nAPI documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-fails-to-start-with-invalidgroupnotfound-error.json b/scraped_kb_articles/cluster-fails-to-start-with-invalidgroupnotfound-error.json new file mode 100644 index 0000000000000000000000000000000000000000..9acc975870a0de93763f91c0b92ffebb8ac9a7f3 --- /dev/null +++ b/scraped_kb_articles/cluster-fails-to-start-with-invalidgroupnotfound-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/cluster-fails-to-start-with-invalidgroupnotfound-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem:\nYour cluster fails to start with an error:\nInvalidGroup.NotFound. The security group 'sg-XYZ' does not exist in VPC 'vpc-XYZ'\nCause:\nThe network security group policy is not correctly configured.\nSolution:\nContact your network engineering team to verify the security group policy is correctly associated with the Databricks workspace VPC.\nEnsure the security group\nsg-XYZ\nexists in the VPC\nvpc-XYZ\n.\nIf the security group does not exist, create it using the appropriate console or API commands.\nIf the security group exists, ensure it is correctly associated with the workspace VPC.\nIf the security group is associated correctly, verify the inbound and outbound rules are configured to allow network traffic for the Databricks cluster.\nAfter updating the security group configuration, restart the cluster and verify that it launches correctly." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-fails-with-fatal-uncaught-exception-error-failed-to-bind.json b/scraped_kb_articles/cluster-fails-with-fatal-uncaught-exception-error-failed-to-bind.json new file mode 100644 index 0000000000000000000000000000000000000000..6ff4b2d4d3932ef877506ee6b5e4e57240e5104d --- /dev/null +++ b/scraped_kb_articles/cluster-fails-with-fatal-uncaught-exception-error-failed-to-bind.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-fails-with-fatal-uncaught-exception-error-failed-to-bind", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nClusters running Databricks Runtime 11.3 LTS or above terminate with a\nFailed to bind\nerror message.\nFatal uncaught exception. Terminating driver.\r\njava.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:6062\nCause\nThis can happen if multiple processes attempt to use the same port. Databricks Runtime 11.3 LTS and above use the\nIPython kernel\n(\nAWS\n|\nAzure\n|\nGCP\n) as the default REPL on port 6062.\nIf you have other software configured to run on the same port it can result in a conflict (for example, Datadog is usually configured on port 6062). If a conflict occurs, the driver node may fail to start.\nSolution\nAs a workaround, you can configure the cluster to use the standard Python shell as the default REPL in the cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nspark.databricks.python.defaultPythonRepl pythonshell\nThis prevents the cluster from using the IPython kernel. As a result, there is no port conflict and the driver node successfully starts." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-init-script-fails-with-mirror-sync-in-progress-error.json b/scraped_kb_articles/cluster-init-script-fails-with-mirror-sync-in-progress-error.json new file mode 100644 index 0000000000000000000000000000000000000000..eae11dfaf22a39fd2ee0f7bb9cf41e96cafd5f2a --- /dev/null +++ b/scraped_kb_articles/cluster-init-script-fails-with-mirror-sync-in-progress-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-init-script-fails-with-mirror-sync-in-progress-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using a custom init script running at cluster start to install a custom library. It works most of the time, but you encounter intermittent failures when\napt-get update\nruns in the init script.\nThe failures return a\nMirror sync in process\nerror message.\nFailed to fetch https://repos..com/zulu/deb/dists/stable/main/binary-amd64/by-hash/SHA512/  File has unexpected size (228870 != 201863). Mirror sync in progress? File has unexpected size (228870 != 201863). Mirror sync in progress? [IP: 123.45.67.89 443]\nCause\nThis can happen if you are trying to download from a mirror that is not in sync with the main repository. Official repositories usually resolve the issue within 30 minutes, however in rare cases it can take much longer.\nSolution\nWait for the mirror to finishing synchronizing with the repository before attempting to start your cluster.\nAlternatively, if the library causing the failure is no longer needed, you can edit your init script to remove the reference to the problematic repo. You should only take this step if you are positive the library is not used directly or as a dependency.\nDelete\nWarning\nIf the repo you are using has multiple mirrors, you can edit\n/etc/apt/sources.list\nor\n/etc/apt/sources.list.d\nto remove the problematic mirror and point to another mirror instead. This allows the init script to complete.\nThis is not recommended as a long term solution." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-manager-limit.json b/scraped_kb_articles/cluster-manager-limit.json new file mode 100644 index 0000000000000000000000000000000000000000..210fa70e4254254aea83f756dc293fee33c158b8 --- /dev/null +++ b/scraped_kb_articles/cluster-manager-limit.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-manager-limit", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA Databricks Notebook or Job API returns the following error:\nUnexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was rejected due to API rate limit. Please retry your request later, or choose a larger node type instead.\nCause\nThe error indicates the Cluster Manager Service core instance\nrequest limit\nwas exceeded.\nA Cluster Manager core instance can support a maximum of 1000 requests.\nSolution\nContact Databricks Support to increase the limit set in the core instance.\nDatabricks can increase the job limit\nmaxBurstyUpsizePerOrg\nup to 2000, and\nupsizeTokenRefillRatePerMin\nup to 120. Current running jobs are affected when the limit is increased.\nIncreasing these values can stop the throttling issue, but can also cause high CPU utilization.\nThe best solution for this issue is to replace the Cluster Manager core instance with a larger instance that can support maximum data transmission rates.\nDatabricks Support can change the current Cluster Manager instance type to a larger one." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-named-init-script-migration-notebook.json b/scraped_kb_articles/cluster-named-init-script-migration-notebook.json new file mode 100644 index 0000000000000000000000000000000000000000..72a7bfeb6fe59163e110b932533e6777f464d38d --- /dev/null +++ b/scraped_kb_articles/cluster-named-init-script-migration-notebook.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-named-init-script-migration-notebook", + "title": "Título do Artigo Desconhecido", + "content": "On Dec 1, 2023, Databricks will disable cluster-named init scripts for all workspaces. This type of init script was previously deprecated and will not be usable after Dec 1, 2023. Cluster-named init scripts were replaced by\ncluster-scoped init scripts in August 2018\n. Cluster-scoped init scripts stored as workspace files continue to be supported.\nDatabricks recommends that you migrate your cluster-named init scripts to cluster-scoped init scripts stored as workspace files as soon as possible.\nYou can manually migrate cluster-named init scripts to\ncluster-scoped init scripts\n(\nAWS\n|\nAzure\n) by removing them from the reserved DBFS path at\n/databricks/init/\nand storing them as workspace files (\nAWS\n|\nAzure\n). Once stored as workspace files, you can configure the init scripts as cluster-scoped init scripts. After migrating the init scripts, you should\nDisable legacy cluster-named init scripts for the workspace\n(\nAWS\n|\nAzure\n).\nAlternatively, Databricks Engineering has created a notebook to help automate the migration process.\nThis notebook does the following:\nCluster-named init scripts in the workspace are migrated to cluster-scoped init scripts stored as workspace files.\nCluster-named init scripts are disabled in the workspace.\nCluster-scoped init scripts stored on DBFS in the workspace are migrated to cluster-scoped init scripts stored as workspace files.\nDelete\nInfo\nCluster-named init scripts were never available on GCP workspaces. Cluster-scoped init scripts on DBFS were used on GCP workspaces and should be migrated to cluster-scoped init scripts stored as workspace files. You can run this notebook on a GCP workspace to migrate your existing cluster-scoped init scripts from DBFS to workspace files.\nInstructions\nDelete\nWarning\nYou must be a Databricks admin to run this migration notebook.\nPrerequisites\nYou must run this migration notebook on a cluster using Databricks Runtime 11.3 LTS or above.\nYou should use a bare cluster (no attached init scripts) to run this migration notebook, as the migration process may force a restart of all modified clusters.\nBefore running the migration notebook, you need to have the scope name and secret name for your Personal Access Token.\nFor more information, please review the\nCreate a Databricks-backed secret scope\n(\nAWS\n|\nAzure\n|\nGCP\n) and the\nCreate a secret in a Databricks-backed scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nDelete\nInfo\nIf the cluster owner no longer exists in the Databricks workspace, the cluster might fail to restart after the init-script migration. In this situation, you will see a\nPERMISSION_DENIED\nerror message.\nError: PERMISSION_DENIED: UserId: 183738271817178 does not exist in the workspace\r\nThe user 183738271817178 does not exist in the workspace anymore\nTo prevent this from happening, you should ensure the cluster owner is a current user in the workspace. You can update the cluster owner by using the\nChange cluster owner API\n(\nAWS\n|\nAzure\n|\nGCP\n).\nDo a dry run\nExecuting a dry run allows you to test the migration notebook in your workspace without making any changes.\nDownload the\nMigrate cluster-named and cluster-scoped init scripts notebook\n.\nImport the notebook to your workspace.\nAttach the notebook to a cluster.\nRun the notebook.\nA UI screen appears after you run the notebook, along with a warning that the last command failed. This is normal.\nEnsure\nDry Run\nis set to\nTrue\nand\nNew Location\nis set to\nWorkspace Files\n.\nEnter the\nScope Name\nand\nSecret Name\ninto the appropriate fields.\nRun the notebook.\nThe results of the dry run appear in the output at the bottom of the notebook.\nMigrate your init scripts\nRun the Migrate cluster-named and cluster-scoped init scripts notebook.\nA UI screen appears after you run the notebook, along with a warning that the last command failed. This is normal.\nIn the\nNew Location\ndrop down menu, select\nWorkspace Files\n.\nEnter the\nScope Name\nand\nSecret Name\ninto the appropriate fields.\nStart the migration by selecting\nFalse\nin the\nDry Run\ndrop down menu.\nThe notebook automatically re-runs when the value in\nDry Run\nis changed.\nOnce the notebook finishes running, all of your cluster-named init scripts are migrated to cluster-scoped init scripts stored as workspace files. All of your cluster-scoped init scripts stored on DBFS are migrated to cluster-scoped init scripts stored as workspace files.\nValidate the migrated init scripts\nThe migrated init scripts are moved to\nworkspace:/init-scripts//\n.\nCluster-named init scripts\nCluster-named init scripts are configured as cluster-scoped init scripts in the corresponding cluster configuration.\nCluster-named init scripts are disabled across the workspace. They should no longer be used.\nCluster-scoped init scripts\nCluster-scoped init scripts on DBFS are now stored as workspace files. The corresponding cluster configurations are automatically updated.\nPermissions\nBecause workspace files have ACLs, the migrated cluster-scoped init scripts are now owned by the admin who ran the migration notebook.\nYou must ensure that permissions are correctly set on the migrated cluster-scoped init scripts if you want other users to be able to run and/or edit the init scripts." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-restart-fails.json b/scraped_kb_articles/cluster-restart-fails.json new file mode 100644 index 0000000000000000000000000000000000000000..78b7eba97a264df359f0624c1c939bbcec8e1a18 --- /dev/null +++ b/scraped_kb_articles/cluster-restart-fails.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-restart-fails", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to submit a command or query to a cluster, it fails with the following error.\nError: `HTTP Response code: 403, Error message: PERMISSION_DENIED: You do not have permission to autostart `.\nCause\nYou do not have permission to restart or start the cluster.\nThis error can occur regardless of whether the query submission comes from a notebook, job, the JDBC or ODBC driver, and so on.\nSolution\nAsk your workspace administrator to update cluster access rights. They should assign at a minimum the CAN_RESTART permission on the target cluster.\nAlternatively, they can assign CAN_MANAGE. These permissions allow performance of key cluster management actions, including starting, restarting, and terminating clusters. To determine who needs to have permissions assigned for the cluster, refer to the following table.\nIf the error comes from a …\nthe entity who needs permission is…\nNotebook\nyou.\nQuery triggered through JDBC/ODBC\nthe user identity used to authenticate to Databricks.\nJob\nthe user mentioned in the job’s\nrun_as\nparameter.\nFor more information, your workspace admin can refer to the “Configure compute permissions” section of the\nManage compute\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nYou can instead ask someone with CAN_RESTART or CAN_MANAGE permissions to start the cluster, and then try to run your command again. For this option, ensure you have the CAN_ATTACH_TO privilege." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-scoped-init-script-to-unity-catalog-volume-migration-notebook.json b/scraped_kb_articles/cluster-scoped-init-script-to-unity-catalog-volume-migration-notebook.json new file mode 100644 index 0000000000000000000000000000000000000000..aacbd8a981a209bb69165ef30f43963bf1c2315c --- /dev/null +++ b/scraped_kb_articles/cluster-scoped-init-script-to-unity-catalog-volume-migration-notebook.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/cluster-scoped-init-script-to-unity-catalog-volume-migration-notebook", + "title": "Título do Artigo Desconhecido", + "content": "On Dec 1, 2023, Databricks will discontinue support of Cluster-scoped init scripts stored as Databricks File System (DBFS) files. Unity Catalog customers should migrate cluster-scoped init scripts stored as DBFS files into Unity Catalog volumes as soon as possible.\nDatabricks Engineering has created a notebook to help automate the migration process, it migrates cluster-scoped init scripts stored on DBFS to cluster-scoped init scripts stored as Unity Catalog volume files.\nInstructions\nDelete\nWarning\nYou must be a Databricks workspace admin in order to get and modify all the clusters and jobs in a workspace. If you are not a workspace admin, you will not be able to run this notebook.\nPrerequisites\nYou must own at least one Unity Catalog\nvolume\nor have\nCREATE_VOLUME\nprivilege on at least one Unity Catalog schema to migrate DBFS files into Unity Catalog Volumes.\nYou must be the metastore admin or have\nMANAGE_ALLOWLIST\npermission on the metastore if any init script is installed on Unity Catalog shared mode clusters as those scripts need to be in the\nmetastore artifact allowlist\n(\nAWS\n|\nAzure\n). Note this only applies to private preview customers who have cluster scoped init scripts stored in DBFS on shared mode clusters.\nYou must run this migration notebook on a cluster using Databricks Runtime 13.3 LTS or above in order to copy files from DBFS into Unity Catalog volumes.\nYou should use a bare cluster (no attached init scripts) to run this migration notebook, as the migration process may force a restart of all modified clusters.\nBefore running the migration notebook, you need to have the scope name and secret name for your Personal Access Token. For more information, please review the\nCreate a Databricks-backed secret scope\n(\nAWS\n|\nAzure\n) and the\nCreate a secret in a Databricks-backed scope\n(\nAWS\n|\nAzure\n) documentation.\nDelete\nInfo\nIf the cluster owner no longer exists in the Databricks workspace, the cluster might fail to restart after the init-script migration. In this situation, you will see a\nPERMISSION_DENIED\nerror message.\nError: PERMISSION_DENIED: UserId: 183738271817178 does not exist in the workspace\r\nThe user 183738271817178 does not exist in the workspace anymore\nTo prevent this from happening, you should ensure the cluster owner is a current user in the workspace. You can update the cluster owner by using the\nChange cluster owner API\n(\nAWS\n|\nAzure\n).\nDo a dry run\nExecuting a dry run allows you to test the migration notebook in your workspace. The dry run mode will copy all the cluster scoped init scripts stored on DBFS into the UC volume you specified, but will not actually replace any init scripts on clusters and jobs with the UC volume init scripts.\nDownload the\nMigrate cluster-scoped init scripts from DBFS to Unity Catalog volumes notebook\n.\nImport the notebook to your workspace.\nAttach the notebook to a cluster running Databricks Runtime 13.3 LTS or above.\nRun the notebook.\nA UI screen appears after you run the notebook, along with a warning that the last command failed. This is normal.\nEnsure Dry Run is set to True.\nEnter the Scope Name and Secret Name into the appropriate fields.\nEnter the volume name which you want to migrate DBFS init scripts into. If the volume does not exist, the script will attempt to create it.\nRun the notebook.\nThe results of the dry run appear in the output at the bottom of the notebook. It will tell you what edits it would attempt to make with the clusters and jobs and what init scripts it would attempt to replace with.\nMigrate your init scripts\nStart the migration by selecting False in the Dry Run drop down menu.\nThe notebook automatically re-runs when the value in Dry Run is changed.\nOnce the notebook finishes running, all of your cluster-scoped init scripts stored on DBFS are migrated to cluster-scoped init scripts stored as Unity Catalog volumes, and the script performs its best effort at replacing existing DBFS init scripts attached to clusters and jobs with init scripts stored in Unity Catalog volumes. If any cluster or jobs failed to be migrated, the errors will be displayed in the notebook results.\nValidate the migrated init scripts\nThe migrated init scripts are moved to the volume you specified, you can check their existence by navigating to the “Data (Catalog)” tab on the left side navigation panel of the UI and then navigate to the volume you chose for the migration. You should be able to see all the init scripts migrated into the volume with their original file structure.\nYou should also be able to see the clusters and jobs now having the migrated init scripts pointing to the new Volume path.\nTroubleshooting\nThere are a few possibilities that the migration of init scripts on clusters or jobs can fail. Here’s a list of the potential error messages and how to resolve them.\nError message\nResolution\nFailed to list all clusters or jobs in workspace\nThe current user is not a workspace admin so the script cannot fetch the cluster and job details in order to migrate the init scripts, make sure that you are running as workspace admin for the migration.\nFailed to create Unity Catalog volume for init script migration\nThe script failed to create the Unity Catalog volume necessary for migrating DBFS init scripts. Give yourself\nCREATE_VOLUME\npermission on the parent schema of the volume you selected, or just select an existing volume which you are an owner of.\nFailed to update Unity Catalog volume permission with\nREAD_VOLUME\nfor account users.\nThe script fails to update the volume so that all cluster owners can read the init scripts within it. This is most likely because the current user is not an owner of the volume selected. Please select a volume that the user is an owner of.\nFailed to allowlist migration UC volume in the metastore artifact allowlist. If you are not a metastore admin, please ask them to grant you\nMANAGE_ALLOWLIST\npermission on the metastore, or help add the path into the metastore artifact allowlist.\nYou have at least one DBFS init script attached on the Unity Catalog shared mode cluster that needs to be migrated. However, you are not metastore admin or do not have the\nMANAGE_ALLOWLIST\npermission on the metastore, so you cannot add the new volume paths into the metastore artifact allowlist for it to be usable on Unity Catalog shared mode clusters. Please either ask the metastore admin to add the volume path into the allowlist following instructions in\nmetastore artifact allowlist\n(\nAWS\n|\nAzure\n), or ask for\nMANAGE_ALLOWLIST\npermission and do it yourself.\nCluster with access mode is not eligible for Unity Catalog volumes.\nOnly clusters with access mode “\nASSIGNED (SINGLE USER)\n” or “\nSHARED (USER_ISOLATION)\n” can use Unity Catalog volumes as init script sources. If you are using DBFS init scripts on non UC clusters, please follow this\narticle\nto migrate the init scripts into workspace files.\nSpark version of cluster is not eligible for Unity Catalog volume based init script (requires >= 13.3 but actual version is _).\nOR\nCluster is using a custom Databricks runtime image which is not eligible for Unity Catalog volume based init script.\nYour cluster or job is using Databricks Runtime 13.2 and below or a custom image provided by Databricks engineers which do not support Unity Catalog Volumes based init scripts. For\nSHARED\nmode clusters, the only resolution is to upgrade to Databricks Runtime 13.3 and above. For\nASSIGNED\nmode clusters, you can also consider following this\narticle\nto migrate the init scripts into workspace files if a Databricks Runtime upgrade is not feasible.\nFailed to migrate cluster from DBFS init script into UC volume init scripts. Error message: _.\nThe script failed to modify the cluster with updated UC volume based init scripts, and the reason is not one of those that we expected. This could be caused by issues like the cluster being created by a managed service like jobs or DLT, or if the current user lacks “\nCAN MANAGE\n” permission to the cluster. The error message should give more details on how to resolve this issue.\nFailed to migrate job from DBFS init script into UC volume init scripts. Error message: _.\nThe script failed to modify the job with updated UC volume based init scripts, and the reason is not one of those that we expected. This could be caused by issues like the current user lacks “\nCAN MANAGE\n” permission to the job. The error message should give more details on how to resolve this issue." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-spark-config-not-applied.json b/scraped_kb_articles/cluster-spark-config-not-applied.json new file mode 100644 index 0000000000000000000000000000000000000000..c776d25d1c4c1110d7112358632352395468f26d --- /dev/null +++ b/scraped_kb_articles/cluster-spark-config-not-applied.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-spark-config-not-applied", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour cluster’s\nSpark configuration\nvalues are not applied.\nCause\nThis happens when the\nSpark config\nvalues are declared in the cluster configuration as well as in an init script.\nWhen Spark config values are located in more than one place, the configuration in the init script takes precedence and the cluster ignores the configuration settings in the UI.\nSolution\nYou should define your Spark configuration values in one place.\nChoose to define the Spark configuration in the cluster configuration or include the Spark configuration in an init script.\nDo not do both." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-startup-failure-while-running-proxy-configured-init-script-with-other-init-scripts.json b/scraped_kb_articles/cluster-startup-failure-while-running-proxy-configured-init-script-with-other-init-scripts.json new file mode 100644 index 0000000000000000000000000000000000000000..8826f638ea5c4fa14dca19ddc83f27b45d3c9596 --- /dev/null +++ b/scraped_kb_articles/cluster-startup-failure-while-running-proxy-configured-init-script-with-other-init-scripts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-startup-failure-while-running-proxy-configured-init-script-with-other-init-scripts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour cluster fails to start when you execute a proxy-configured init script and another init script (for example, an init script for installing libraries using\nrequirements.txt\n) simultaneously.\nCause\nThe proxy-configured init script applies proxy settings to all outgoing network requests, including requests to local services necessary for accessing the workspace file system.\nWhen you also execute the\npip install -r requirements.txt\ncommand within the cluster-scoped init script, the request to access the local workspace file system is routed through the proxy server.\nDue to the proxy configuration, the server blocks these local requests, causing the cluster startup to fail.\nSolution\nAdd the following line to your proxy init script. This configuration bypasses the proxy for local addresses (\nlocalhost\nand\n127.0.0.1\n), allowing the cluster to access the workspace file system directly.\nexport no_proxy=localhost,127.0.0.1\nRestart the cluster to apply the changes." +} \ No newline at end of file diff --git a/scraped_kb_articles/cluster-terminated-driver-down.json b/scraped_kb_articles/cluster-terminated-driver-down.json new file mode 100644 index 0000000000000000000000000000000000000000..44633c66131158f619b26d1777dfacab7c6c175c --- /dev/null +++ b/scraped_kb_articles/cluster-terminated-driver-down.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/cluster-terminated-driver-down", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou try to start a cluster, but it fails to start. You get an Apache Spark error message.\nInternal error message: Spark error: Driver down\nYou review the\ncluster driver and worker logs\nand see an error message containing\njava.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist\n.\n21/07/14 21:44:06 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver.\r\njava.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist\r\n   at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)\r\n   at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)\r\n   at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)\r\n   at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)\r\n   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1668)\r\n   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1632)\r\n   at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:511)\r\n   at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:511)\r\n   at scala.collection.immutable.List.foreach(List.scala:392)\nCause\nYou have\nspark.files dummy\nset in your\nSpark Config\n, but no such file exists.\nSpark interprets the\ndummy\nconfiguration value as a valid file path and tries to find it on the local file system. If the file does not exist, it generates the error message.\njava.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist\nSolution\nOption 1:\nDelete\nspark.files dummy\nfrom your\nSpark Config\nif you are not passing actual files to Spark.\nOption 2:\nCreate a dummy file and place it on the cluster. You can do this with an init script.\nCreate the init script.\n%python\r\ndbutils.fs.put(\"dbfs:/databricks//create_dummy_file.sh\",\r\n\"\"\"\r\n#!/bin/bash\r\ntouch /databricks/driver/dummy\"\"\", True)\nInstall the init script that you just created as a\ncluster-scoped init script\n.\nYou will need the full path to the location of the script (\ndbfs:/databricks//create_dummy_file.sh\n).\nRestart the cluster\nRestart your cluster after you have installed the init script." +} \ No newline at end of file diff --git a/scraped_kb_articles/clusters-using-docker-databricksruntimelatest-tag-are-not-starting.json b/scraped_kb_articles/clusters-using-docker-databricksruntimelatest-tag-are-not-starting.json new file mode 100644 index 0000000000000000000000000000000000000000..151c81c96b5ca96a03cdc56fe5e16b8e7e2aa408 --- /dev/null +++ b/scraped_kb_articles/clusters-using-docker-databricksruntimelatest-tag-are-not-starting.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/clusters-using-docker-databricksruntimelatest-tag-are-not-starting", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Docker containers based off\ndatabricksruntime\nand using the latest tag, your clusters with Docker Container Services (DCS) in Databricks hang or fail to execute Python cells even though they were previously operational.\nusr/lib/python3.8/asyncio/base_events.py:1829: RuntimeWarning: coroutine 'DatabricksKernel.do_execute' was never awaited\r\nhandle = self._ready.popleft()\r\nRuntimeWarning: Enable tracemalloc to get the object allocation traceback\nCause\nUse of the\nlatest\ntag is no longer supported.\nSolution\nUpdate the Docker image to use a specific Databricks Runtime LTS version instead of the\nlatest\ntag. For example, use\ndatabricksruntime/standard:12.2-LTS\n.\nPreventative measures\nAvoid using the\nlatest\ntag for Docker images and always specify a supported runtime version.\nRegularly review Databricks release notes and updates to stay informed about changes that may impact your environment.\nFor further reading and setting up clusters using the image, refer to the\nCustomize containers with Databricks Container Service\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cmd-c-on-object-id-p0.json b/scraped_kb_articles/cmd-c-on-object-id-p0.json new file mode 100644 index 0000000000000000000000000000000000000000..d24b3ead195d8e6b1cbcc9c1a6e0938b71bff4bd --- /dev/null +++ b/scraped_kb_articles/cmd-c-on-object-id-p0.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/cmd-c-on-object-id-p0", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have imported Python libraries, but when you try to execute Python code in a notebook you get a repeating message as output.\nExample 1:\nINFO:py4j.java_gateway:Received command c on object id p0\nINFO:py4j.java_gateway:Received command c on object id p0\nINFO:py4j.java_gateway:Received command c on object id p0\nINFO:py4j.java_gateway:Received command c on object id p0\nINFO:py4j.java_gateway:Received command c on object id p0\nINFO:py4j.java_gateway:Received command c on object id p0\nINFO:py4j.java_gateway:Received command c on object id p0\nExample 1:\nINFO:py4j.clientserver:Received command c on object id p0\nINFO:py4j.clientserver:Received command c on object id p0\nINFO:py4j.clientserver:Received command c on object id p0\nINFO:py4j.clientserver:Received command c on object id p0\nINFO:py4j.clientserver:Received command c on object id p0\nINFO:py4j.clientserver:Received command c on object id p0\nCause\nThe default log level for\npy4j.java_gateway/py4j.clientserver\nare\nERROR\n.\nIf any of the imported Python libraries set the log level to\nINFO\nyou will see this message.\nSolution\nYou can prevent the output of the\nINFO\nmessages by setting the log level back to\nERROR\nafter importing the libraries.\n%python\nimport logging\nlogger = spark._jvm.org.apache.log4j\nlogging.getLogger(\"py4j\").setLevel(logging.ERROR)" +} \ No newline at end of file diff --git a/scraped_kb_articles/column-drift-when-reading-multiple-delimited-files-.json b/scraped_kb_articles/column-drift-when-reading-multiple-delimited-files-.json new file mode 100644 index 0000000000000000000000000000000000000000..7b2e046f847f1621a061830b35d2acc594ca3545 --- /dev/null +++ b/scraped_kb_articles/column-drift-when-reading-multiple-delimited-files-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/column-drift-when-reading-multiple-delimited-files-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice column drift while reading multiple delimited files in a single\nspark.read\noperation. This problem manifests as columns being incorrectly mapped, leading to data integrity issues.\nExample\nspark.read.format(\"csv\").load(/*)\nWhere\nsource-directory\ncontains multiple CSV files.\nCause\nWhen multiple files with different schemas are read together, Databricks infers the schema from a sample of records.\nExample\nYou have 10 files in a source directory with 2 columns, column A and column B. Some files have 'column A' as the first column, while others have 'column B' as the first column in the schema.\nIf the schema is inferred from files with 'column A' as the first column, this will cause files with 'column B' as the first column to be mapped incorrectly. This issue is expected behavior when there is a schema difference among the source files.\nSolution\nEnsure that all files being processed together have the same schema. You can standardize the source files’ schema before processing.\nIf processing multiple files with different schemas is unavoidable, process each file individually to avoid schema inference issues.\nRegularly monitor and validate the schema of the source files to ensure consistency." +} \ No newline at end of file diff --git a/scraped_kb_articles/column-name-error-when-using-apache-spark-mlib-feature-transformers.json b/scraped_kb_articles/column-name-error-when-using-apache-spark-mlib-feature-transformers.json new file mode 100644 index 0000000000000000000000000000000000000000..57aff6a3f2ddb842e4f35367b87d54bc5ffb0196 --- /dev/null +++ b/scraped_kb_articles/column-name-error-when-using-apache-spark-mlib-feature-transformers.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/column-name-error-when-using-apache-spark-mlib-feature-transformers", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use the Apache Spark MLlib\nStringIndexer\n, or other feature transformers with columns that are nested (meaning they have dots in their names, like\ncolumnname_part1.columnname_part2\n), you receive an error that the column name cannot be resolved.\norg.apache.spark.sql.AnalysisException: Cannot resolve column name \"columnname_part1.columnname_part2\" among (columnname_part1.columnname_part2, columnname_part1.columnname_part3); did you mean to quote the `columnname_part1.columnname_part2` column?\nAlternatively you receive an error that the column does not exist.\norg.apache.spark.SparkException: Input column `columnname_part1.columnname_part2` does not exist.\nCause\nSpark uses dots to identify nested structures in data (like\npart1\nand\npart2\nin\ncolumnname_part1.columnname_part2\n). When there’s a dot in the column name, Spark tries to treat the name as nested data.\nSolution\nRename the nested columns by replacing dots with underscores. Spark no longer sees the columns as nested, removing the need to use backticks.\ncolumnname_part1_columnname_part2\nExample in context\nval si = new StringIndexer().setInputCol(\"columnname_part1_columnname_part2\").setOutputCol(\"columnname_part1_indexed\")\r\nval pipeline = new Pipeline().setStages(Array(si))\r\npipeline.fit(flattenedDf).transform(flattenedDf).show()\nTo avoid this issue in the future, use underscores as a standard replacement for dots in column names across your data.\nImportant\nApproaches like wrapping column names in backticks or flattening nested data may help, but have disadvantages.\nWrapping column names in backticks makes the code harder to read and manage.\nFlattening the nested data by splitting it into separate columns with unique names creates a lot of new columns, making the data harder to work with and slowing down processing." +} \ No newline at end of file diff --git a/scraped_kb_articles/column-statistics-missing-when-running-analyze-table-table-compute-statistics-after-analyze-table-table-compute-statistics-for-all-columns.json b/scraped_kb_articles/column-statistics-missing-when-running-analyze-table-table-compute-statistics-after-analyze-table-table-compute-statistics-for-all-columns.json new file mode 100644 index 0000000000000000000000000000000000000000..d4205a4fcc7e10077350665794ef084808843579 --- /dev/null +++ b/scraped_kb_articles/column-statistics-missing-when-running-analyze-table-table-compute-statistics-after-analyze-table-table-compute-statistics-for-all-columns.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/column-statistics-missing-when-running-analyze-table-table-compute-statistics-after-analyze-table-table-compute-statistics-for-all-columns", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe statistics collected by the\nANALYZE TABLE
COMPUTE STATISTICS FOR ALL COLUMNS\ncommand are then not available when you subsequently run\nANALYZE TABLE
COMPUTE STATISTICS\n.\nYou may then notice suboptimal query performance because the query optimizer doesn’t have accurate statistics to generate efficient execution plans.\nCause\nThe\nANALYZE TABLE
COMPUTE STATISTICS\ncommand only computes the\nsizeInBytes\nand\nnumRows\nstatistics, not the column statistics.\nWhen you run\nANALYZE TABLE
COMPUTE STATISTICS\n, it overwrites the column statistics that\nANALYZE TABLE
COMPUTE STATISTICS FOR ALL COLUMNS\npreviously computed.\nAdditional context\nCertain usage patterns such as running\nANALYZE TABLE
COMPUTE STATISTICS\nmore frequently than\nANALYZE TABLE
COMPUTE STATISTICS FOR ALL COLUMNS\ncan exacerbate the issue, leading to the column statistics being frequently overwritten and not available for query optimization.\nSolution\nEnsure you only run\nANALYZE TABLE
COMPUTE STATISTICS FOR ALL COLUMNS\nto collect statistics for your tables.\nThis command computes both the\nsizeInBytes\nand\nnumRows\nstatistics, as well as the column statistics, so overwrites the previously computed data with new, more inclusive data." +} \ No newline at end of file diff --git a/scraped_kb_articles/column-value-errors-when-connecting-from-apache-spark-to-databricks-using-spark-jdbc.json b/scraped_kb_articles/column-value-errors-when-connecting-from-apache-spark-to-databricks-using-spark-jdbc.json new file mode 100644 index 0000000000000000000000000000000000000000..ed8fb5e4633dbf93c5cc2a23b1c6a23a3e7276d9 --- /dev/null +++ b/scraped_kb_articles/column-value-errors-when-connecting-from-apache-spark-to-databricks-using-spark-jdbc.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/column-value-errors-when-connecting-from-apache-spark-to-databricks-using-spark-jdbc", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen connecting Apache Spark to Databricks using Spark JDBC to read data from tables, you observe that column names are returned when you expect actual column values.\nAlternatively, you may receive an SQL exception where the column name is returned (string) when you expect an integer (actual column data type).\nExample of the SQL Exception\n[Databricks][JDBC](10140) Error converting value to int.\r\nat com.databricks.client.exceptions.ExceptionConverter.toSQLException(Unknown Source)\r\nat com.databricks.client.utilities.conversion.TypeConverter.toInt(Unknown Source)\r\nat com.databricks.client.jdbc.common.SForwardResultSet.getInt(Unknown Source)\r\nat org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$7(JdbcUtils.scala:412)\nCause\nApache Spark converts the query\nSELECT columnName FROM tableName\ninto\nSELECT \"columnName\" FROM (SELECT columnName FROM tableName)\n. This causes all returned data from the SQL to be hard-coded column names.\nSolution\nFor versions under 4.0.0, override the\nquoteIdentifier\nfunction in the\nJdbcDialect\nclass and register it under\nJDBCDialects\n.\nJava\nDefine a class that extends\nJdbcDialect\nand then registers it before using\nspark.read.jdbc\nto read the table.\nimport org.apache.spark.sql.jdbc.JdbcDialects;\r\nimport org.apache.spark.sql.jdbc.JdbcType;\r\nimport org.apache.spark.sql.jdbc.JdbcDialect;\r\npublic class dbsqlDialectClass extends JdbcDialect {\r\n@Override\r\npublic boolean canHandle(String url) {\r\nreturn url.startsWith(\"jdbc:databricks\");\r\n}\r\n@Override\r\npublic String quoteIdentifier(String colName) {\r\nreturn \"`\"+colName+\"`\";\r\n}\r\n}\nRegister the dialect before using\nspark.read.jdbc\nto read the table, after the Spark session has been created.\ndbsqlDialectClass dbsqlDialect = new dbsqlDialectClass();\r\nJdbcDialects.registerDialect(dbsqlDialect);\r\nString url = \"jdbc:databricks://:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/;UseNativeQuery=1;UID=token;connCatalog=hive_metastore;PWD=\"+;\r\nString pushdown_query1 = \"select `` from \";\r\nDataset df = spark.read()\r\n.format(\"jdbc\")\r\n.option(\"url\", url)\r\n.option(\"driver\", \"com.databricks.client.jdbc.Driver\")\r\n.option(\"query\", pushdown_query1)\r\n.option(\"Auth_AccessToken\", )\r\n.load();\nPython\nUse the first part of the code to build a side jar.\nAdd the side jar to the class path so that the Java Virtual Machine (JVM) loads it when coming up.\nUse the\npy4j\nmodule to import the class into the below code snippet.\nfrom py4j.java_gateway import java_import\r\ngw = spark.sparkContext._gateway\r\njava_import(gw.jvm, \"dbsqlDialectClass\")\r\ngw.jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(\r\ngw.jvm.dbsqlDialectClass())\r\nurl = \"jdbc:databricks://:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/;UID=token;PWD=\" + \r\npushdown_query = \"(\"select `` from \") as query\"\r\nloadedDF = spark.read .format(\"jdbc\") .option(\"url\", url).option(\"driver\", \"com.databricks.client.jdbc.Driver\").option(\"dbtable\", pushdown_query).load()\r\ndisplay(loadedDF)\nNote\nThe\nissue has been fixed in Spark version 4.0.0\n. However 4.0.0 is still in preview, so not used for production loads as of this article’s publish date." +} \ No newline at end of file diff --git a/scraped_kb_articles/column-values-assigning-in-the-order-they-are-passed-into-row-as-arguments-not-to-the-column-name-indicated.json b/scraped_kb_articles/column-values-assigning-in-the-order-they-are-passed-into-row-as-arguments-not-to-the-column-name-indicated.json new file mode 100644 index 0000000000000000000000000000000000000000..7e91002d96b66620395774ddee7a3b3ce3214516 --- /dev/null +++ b/scraped_kb_articles/column-values-assigning-in-the-order-they-are-passed-into-row-as-arguments-not-to-the-column-name-indicated.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/column-values-assigning-in-the-order-they-are-passed-into-row-as-arguments-not-to-the-column-name-indicated", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating a DataFrame using\nRow()\n, you pass in arguments defining the first and second column names and values in any order. In the output, you notice the column values are assigned in the order they were passed in, not to the column you indicated.\nExample code\nfrom pyspark.sql import Row\r\n\r\nrow1 = Row(FirstColumn=1, SecondColumn=2)\r\nrow2 = Row(SecondColumn=3, FirstColumn=4)\r\n\r\ndf = spark.createDataFrame([row1, row2])\r\ndf.show()\nExpected output\nFirstColumn\nSecondColumn\n1\n2\n4\n3\nActual output\nFirstColumn\nSecondColumn\n1\n2\n3\n4\nCause\nWhen you create a DataFrame using\nRow()\nwith named arguments, it inherits a tuple instead of a dictionary, so input argument mapping does not occur.\nSolution\nCreate the DataFrame from a list of dictionaries, or use the\nrow.toDict()\nmethod.\nTo create the DataFrame from a list of dictionaries, adapt the following example code.\ndata = [\r\n    {\"FirstColumn\": 1, \"SecondColumn\": 2},\r\n    {\"SecondColumn\": 3, \"FirstColumn\": 4}\r\n]\r\n\r\ndf2 = spark.createDataFrame(data)\r\ndf2.show()\nAlternatively, to use the\nrow.toDict()\nmethod, adapt the following example code.\nrow1 = Row(FirstColumn=1, SecondColumn=2)\r\nrow2 = Row(SecondColumn=3, FirstColumn=4)\r\n\r\nrow1_dict = row1.asDict()\r\nrow2_dict = row2.asDict()\r\n\r\ndf = spark.createDataFrame([row1_dict, row2_dict])\r\ndf.show()" +} \ No newline at end of file diff --git a/scraped_kb_articles/command-%25run-not-working-on-delta-live-tables-dlt-pipelinesr.json b/scraped_kb_articles/command-%25run-not-working-on-delta-live-tables-dlt-pipelinesr.json new file mode 100644 index 0000000000000000000000000000000000000000..6d8e60914cf0b7eca82e68780e3f2d71f29a765d --- /dev/null +++ b/scraped_kb_articles/command-%25run-not-working-on-delta-live-tables-dlt-pipelinesr.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/command-%25run-not-working-on-delta-live-tables-dlt-pipelinesr", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to use the\n%run\ncommand to execute another notebook within a Delta Live Tables (DLT) pipeline you receive an error.\n`%run is not supported in DLT pipelines.`\nCause\nThe\n%run\ncommand is not supported in DLT pipelines.\nContext\nDLT pipelines are designed to manage dependencies and orchestrate data transformations in a declarative manner. The\n%run\ncommand is an imperative way to execute code. This difference can lead to issues with reproducibility and dependency management in a DLT pipeline.\nFurther, when using the\n%run\ncommand in a DLT pipeline, the pipeline does not have visibility into code executed in a referenced notebook. The lack of visibility results in unexpected behavior because the pipeline is unable to properly manage dependencies or track changes to the code.\nLast, DLT pipelines are optimized for performance and scalability, and the\n%run\ncommand can introduce overhead that negatively impacts the performance of the pipeline.\nSolution\nRefactor your code to remove the\n%run\ncommand and use DLT's built-in features for code reuse and dependency management instead.\n1. Identify the code in the referenced notebook that needs to be reused.\n2. Extract the reusable code into a separate notebook or function.\n3. In the DLT pipeline, import the reusable code using the\nimport\nstatement or call the function directly.\n4. Ensure that all dependencies are properly managed and declared in the DLT pipeline." +} \ No newline at end of file diff --git a/scraped_kb_articles/comments-not-reflecting-on-unity-catalog-tables.json b/scraped_kb_articles/comments-not-reflecting-on-unity-catalog-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..f1ec65962a56a31c1c8d4d158c7d634d0a907775 --- /dev/null +++ b/scraped_kb_articles/comments-not-reflecting-on-unity-catalog-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/comments-not-reflecting-on-unity-catalog-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou add or alter comments to your Unity Catalog table from either the\nCatalog\nUI or a notebook and you see the command executes successfully. However, when you review the UI, you don’t see the comments in the table.\nCause\nThe\nspark.databricks.delta.catalog.update.enabled\nconfiguration is set to\nfalse\nin the cluster or SQL warehouse settings. When this setting is false, automatic updates to Unity Catalog metadata (such as comments) are disabled, and the UI doesn’t show changes made through any means.\nSolution\nFor All-Purpose Clusters\nGo to the cluster's Spark configuration settings.\nLocate the\nspark.databricks.delta.catalog.update.enabled\nproperty.\nChange the setting to\ntrue\n.\nRestart the cluster to apply the changes.\nFor SQL Warehouses\nNavigate to\nWorkspace Settings > Compute\n.\nUnder\nSQL Warehouses\n, select\nManage\n.\nIn the\nData Access Properties\nsection, find\nspark.databricks.delta.catalog.update.enabled\n.\nSet it to\ntrue\n.\nSave the changes and restart the SQL Warehouse.\nNote\nWhen\nspark.databricks.delta.catalog.update.enabled\nis set to\ntrue\n, the Delta catalog shows changes to Delta tables, such as schema updates or data modifications, in the\nCatalog\nUI without requiring manual action. It doesn’t change anything in the database." +} \ No newline at end of file diff --git a/scraped_kb_articles/common-errors-adf.json b/scraped_kb_articles/common-errors-adf.json new file mode 100644 index 0000000000000000000000000000000000000000..010a4e40b56aacd050e2891a66b97482dc30823b --- /dev/null +++ b/scraped_kb_articles/common-errors-adf.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/common-errors-adf", + "title": "Título do Artigo Desconhecido", + "content": "Azure Data Factory is a managed service that lets you author data pipelines using Azure Databricks notebooks, JARs, and Python scripts. This article describes common issues and solutions.\nCluster could not be created\nWhen you create a data pipeline in Azure Data Factory that uses an Azure Databricks-related activity such as Notebook Activity, you can ask for a new cluster to be created. In Azure, cluster creation can fail for a variety of reasons:\nYour Azure subscription is limited in the number of virtual machines that can be provisioned.\nFailed to create cluster because of Azure quota\nindicates that the subscription you are using does not have enough quota to create the needed resources. For example, if you request 500 cores but your quota is 50 cores, the request will fail. Contact Azure Support to request a quota increase.\nAzure resource provider is currently under high load and requests are being throttled.\nThis error indicates that your Azure subscription or perhaps even the region is being throttled. Simply retrying the data pipeline may not help. Learn more about this issue at\nTroubleshooting API throttling errors\n.\nCould not launch cluster due to cloud provider failures\nindicates a generic failure to provision one or more virtual machines for the cluster. Wait and try again later.\nCluster ran into issues during data pipeline execution\nAzure Databricks includes a variety of mechanisms that increase the resilience of your Apache Spark cluster. That said, it cannot recover from every failure, leading to errors like this:\nConnection refused\nRPC timed out\nExchange times out after X seconds\nCluster became unreachable during run\nToo many execution contexts are open right now\nDriver was restarted during run\nContext ExecutionContextId is disconnected\nCould not reach driver of cluster for X seconds\nMost of the time, these errors do not indicate an issue with the underlying infrastructure of Azure. Instead, it is quite likely that the cluster has too many jobs running on it, which can overload the cluster and cause timeouts.\nAs a general rule, you should move heavier data pipelines to run on their own Azure Databricks clusters.\nIntegrating with Azure Monitor\nand observing execution metrics with\nGrafana\ncan provide insight into clusters that are getting overloaded.\nAzure Databricks service is experiencing high load\nYou may notice that certain data pipelines fail with errors like these:\nThe service at {API} is temporarily unavailable\nJobs is not fully initialized yet. Please retry later\nFailed or timeout processing HTTP request\nNo webapps are available to handle your request\nThese errors indicate that the Azure Databricks service is under heavy load. If this happens, try limiting the number of concurrent data pipelines that include a Azure Databricks activity. For example, if you are performing ETL with 1,000 tables from source to destination, instead of launching a data pipeline per table, either combine multiple tables in one data pipeline or stagger their execution so they don’t all trigger at once.\nDelete\nInfo\nAzure Databricks will not allow you to create more than 1,000 Jobs in a 3,600 second window. If you try to do so with Azure Data Factory, your data pipeline will fail.\nThese errors can also show if you poll the Databricks Jobs API for job run status too frequently (e.g. every 5 seconds). The remedy is to reduce the frequency of polling.\nLibrary installation timeout\nAzure Databricks includes robust support for installing third-party libraries. Unfortunately, you may see issues like this:\nFailed or timed out installing libraries\nThis happens because every time you start a cluster with a library attached, Azure Databricks downloads the library from the appropriate repository (such as PyPI). This operation can time out, causing your cluster to fail to start.\nThere is no simple solution for this problem, other than limiting the number of libraries you attach to clusters." +} \ No newline at end of file diff --git a/scraped_kb_articles/common-errors-in-notebooks.json b/scraped_kb_articles/common-errors-in-notebooks.json new file mode 100644 index 0000000000000000000000000000000000000000..367da0e092c512666b614469d4d8403e1c9a0f03 --- /dev/null +++ b/scraped_kb_articles/common-errors-in-notebooks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/common-errors-in-notebooks", + "title": "Título do Artigo Desconhecido", + "content": "There are some common issues that occur when using notebooks. This section outlines some of the frequently asked questions and best practices that you should follow.\nSpark job fails with java.lang.NoClassDefFoundError\nSometimes you may come across an error like:\n%scala\r\n\r\njava.lang.NoClassDefFoundError: Could not initialize class line.....$read$\nThis can occur with a Spark Scala 2.11 cluster and a Scala notebook, if you mix together a case class definition and Dataset/DataFrame operations in the same notebook cell, and later use the case class in a Spark job in a different cell. For example, in the first cell, say you define a case class\nMyClass\nand also created a Dataset.\n%scala\r\n\r\ncase class MyClass(value: Int)\r\n\r\nval dataset = spark.createDataset(Seq(1))\nThen in a later cell, you create instances of\nMyClass\ninside a Spark job.\n%scala\r\n\r\ndataset.map { i => MyClass(i) }.count()\nSolution\nMove the case class definition to a cell of its own.\n%scala\r\n\r\ncase class MyClass(value: Int)   // no other code in this cell\n%scala\r\n\r\nval dataset = spark.createDataset(Seq(1))\r\ndataset.map { i => MyClass(i) }.count()\nSpark job fails with java.lang.UnsupportedOperationException\nSometimes you may come across an error like:\njava.lang.UnsupportedOperationException: Accumulator must be registered before send to executor\nThis can occur with a Spark Scala 2.10 cluster and a Scala notebook. The reason and solution for this error are same as the prior\nSpark job fails with java.lang.NoClassDefFoundError\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/compare-versions-delta-table.json b/scraped_kb_articles/compare-versions-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..d8b386fc8b91e838922eba905083a0a5f59413bb --- /dev/null +++ b/scraped_kb_articles/compare-versions-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/compare-versions-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table.\nOne common use case is to compare two versions of a Delta table in order to identify what changed.\nFor more details on time travel, please review the Delta Lake time travel documentation (\nAWS\n|\nAzure\n|\nGCP\n).\nIdentify all differences\nYou can use a SQL\nSELECT\nquery to identify all differences between two versions of a Delta table.\nYou need to know the name of the table and the version numbers of the snapshots you want to compare.\n%sql\r\n\r\nselect * from @v\r\nexcept all\r\nselect * from\r\n@v\nFor example, if you had a table named “schedule” and you wanted to compare version 2 with the original version, your query would look like this:\n%sql\r\n\r\nselect * from schedule@v2\r\nexcept all\r\nselect * from\r\nschedule@v0\nIdentify files added to a specific version\nYou can use a Scala query to retrieve a list of files that were added to a specific version of the Delta table.\n%scala\r\n\r\ndisplay(spark.read.json(\"dbfs://_delta_log/00000000000000000002.json\").where(\"add is not null\").select(\"add.path\"))\nIn this example, we are getting a list of all files that were added to version 2 of the Delta table.\n00000000000000000002.json\ncontains the list of all files in version 2.\nAfter reading in the full list, we are excluding files that already existed, so the displayed list only includes files added to version 2." +} \ No newline at end of file diff --git a/scraped_kb_articles/concurrent-execution-of-create-or-replace-function-statements-leads-to-intermittent-routine_not_found-errors.json b/scraped_kb_articles/concurrent-execution-of-create-or-replace-function-statements-leads-to-intermittent-routine_not_found-errors.json new file mode 100644 index 0000000000000000000000000000000000000000..33b1aff6f073b547ce3e80736e51bec111815961 --- /dev/null +++ b/scraped_kb_articles/concurrent-execution-of-create-or-replace-function-statements-leads-to-intermittent-routine_not_found-errors.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/concurrent-execution-of-create-or-replace-function-statements-leads-to-intermittent-routine_not_found-errors", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou're running a job orchestrated by\ndbt-databricks\n, which invokes a Python User-Defined Function (UDF). The job fails consistently whenever executed, particularly in scenarios where multiple UDFs are being created or replaced concurrently.\nYou receive the following error.\n[ROUTINE_NOT_FOUND] The routine ``.``.`` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema and catalog, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP ... IF EXISTS. SQLSTATE: 42883\nCause\nThe\nCREATE OR REPLACE FUNCTION\nstatements are concurrently executing.\nThese operations are not atomic – they involve an implicit drop of the existing function followed by the creation of the new one. Delta Lake does not provide ACID guarantees for concurrent DDL statements of this type.\nWhen multiple\nCREATE OR REPLACE FUNCTION\ncommands for the same function are executed simultaneously, one execution drops the function while another is still referencing or re-creating it. This creates a transient state where the function is neither available nor consistently visible to the querying engine, leading to the\nROUTINE_NOT_FOUND\nerror.\nThis issue commonly arises in parallelized pipelines or orchestrators like dbt, where concurrent runs attempt to redefine the same function.\nSolution\nAvoid recreating UDFs frequently, implement retry logic, or use explicit\nDROP\nand\nCREATE\nfor UDFs.\nAvoid recreating UDFs frequently\nInstead of redefining functions every run, define the UDFs once during environment setup and reuse them. This approach is both safer and more efficient.\nImplement retry logic\nIf you can’t avoid concurrency, incorporate retry logic with exponential backoff into your UDF calls. This can help mitigate transient errors due to visibility issues in the system catalog. You can use the following example code. The retries are set to\n3\nand the backoff factor is set to\n0.5\n.\nimport time\r\nfrom databricks import sql\r\n\r\ndef execute_query_with_retry(query, max_retries=3, backoff_factor=0.5):\r\n    retry_count = 0\r\n    while retry_count < max_retries:\r\n        try:\r\n            with sql.connect(server_hostname=\"\", http_path=\"\", access_token=\"\") as connection:\r\n                with connection.cursor() as cursor:\r\n                    cursor.execute(query)\r\n                    return cursor.fetchall()\r\n        except sql.exc.Error as e:\r\n            if \"ROUTINE_NOT_FOUND\" in str(e):\r\n                retry_count += 1\r\n                time.sleep(backoff_factor * (2 ** retry_count))\r\n            else:\r\n                raise e\r\n    raise Exception(f\"Failed after {max_retries} retries\")\nUse explicit DROP and CREATE for UDFs\nInstead of relying on\nCREATE OR REPLACE FUNCTION\n, use an explicit\nDROP FUNCTION IF EXISTS\nfollowed by\nCREATE FUNCTION\nto avoid transient states that can lead to visibility issues during concurrent execution. You can use the following example code. Be sure to provide your own Python logic.\nDROP FUNCTION IF EXISTS ..;\r\n\r\nCREATE FUNCTION ..()\r\nRETURNS \r\nLANGUAGE PYTHON\r\nDETERMINISTIC\r\nAS $$\r\n# Your Python logic here\r\ndef transform(value):\r\n    # Example transformation\r\n    return some_transformation(value) if value else None\r\nreturn transform()\r\n$$;" +} \ No newline at end of file diff --git a/scraped_kb_articles/concurrent_query-error-on-auto-loader-job.json b/scraped_kb_articles/concurrent_query-error-on-auto-loader-job.json new file mode 100644 index 0000000000000000000000000000000000000000..006245749139a0f50b9ca36cc25a949ff6c902d5 --- /dev/null +++ b/scraped_kb_articles/concurrent_query-error-on-auto-loader-job.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/concurrent_query-error-on-auto-loader-job", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile running an Auto Loader job based on file triggering, which watches a storage location, you encounter a concurrent query error.\norg.apache.spark.SparkConcurrentModificationException: [CONCURRENT_QUERY] Another instance of this query [id: ] was just started by a concurrent session [existing runId: new runId: ].\nCause\nMultiple instances of the same streaming query are running concurrently on the same cluster.\nWhen an Auto Loader job is configured to run in\navailable now\nmode, it triggers a new job instance for every new file in the storage location.\nWhenever a new file arrives while the previous job instance is still running, it causes a conflict, resulting in a\n[CONCURRENT_QUERY]\nerror.\nSolution\nConfigure the Auto Loader job to run in\ncontinuous\nmode instead of\navailable now\nmode. This ensures that the job processes new files as they arrive, without triggering a new job instance for each file.\nIf your specific situation only allows the job to run when new files arrive, you have to implement backpressure handling mechanisms in your streaming job, such as rate limiting, windowing, or batching, to ensure that the job can handle the rate of data ingestion." +} \ No newline at end of file diff --git a/scraped_kb_articles/conda-fails-to-download-packages-from-anaconda.json b/scraped_kb_articles/conda-fails-to-download-packages-from-anaconda.json new file mode 100644 index 0000000000000000000000000000000000000000..ccbbfe4fbfc7064423d11094a093b26ebece358d --- /dev/null +++ b/scraped_kb_articles/conda-fails-to-download-packages-from-anaconda.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/conda-fails-to-download-packages-from-anaconda", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to download packages from the Anaconda repository and get a\nPackagesNotFoundError\nerror message.\nThis error can occur when using\n%conda\n, or\n%sh conda\nin notebooks, and when using Conda in an init script.\nCause\nAnaconda Inc. updated the\nterms of service\nfor\nrepo.anaconda.com\nand\nanaconda.org/anaconda\n. Based on the Anaconda terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. You should review the\nAnaconda Commercial Edition FAQ\nfor more information.\nDelete\nInfo\nYour use of any Anaconda channels is governed by the Anaconda terms of service.\nAs a result, the default channel configuration for the Conda package manager was removed in Databricks Runtime 7.3 LTS for Machine Learning and above.\nSolution\nYou should review the Anaconda terms of service and determine if you require a commercial license.\nOnce you have verified that you have a valid license, you must specify a channel to install or update packages with Conda. You can specify a Conda channel with\n-c \n.\nFor example,\n%conda install matplotlib\nreturns an error, while\n%conda install -c defaults matplotlib\ninstalls\nmatplotlib\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/conf-overwrites-default-settings.json b/scraped_kb_articles/conf-overwrites-default-settings.json new file mode 100644 index 0000000000000000000000000000000000000000..185ac69ce453a6b51b5c0e32f41c3373eec9dcd8 --- /dev/null +++ b/scraped_kb_articles/conf-overwrites-default-settings.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/conf-overwrites-default-settings", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you add a configuration setting by entering it in the Apache\nSpark config\ntext area, the new setting replaces existing settings instead of being appended.\nVersion\nDatabricks Runtime 5.1 and below.\nCause\nWhen the cluster restarts, the cluster reads settings from a configuration file that is created in the\nClusters\nUI, and overwrites the default settings.\nFor example, when you add the following\nextraJavaOptions\nto the\nSpark config\ntext area:\nspark.executor.extraJavaOptions -\r\njavaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus\r\n_jmx_exporter/jmx_prometheus_javaagent.yml\nThen, in\nSpark UI\n>\nEnvironment\n>\nSpark Properties\nunder\nspark.executor.extraJavaOptions\n, only the newly added configuration setting shows:\n-javaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus\r\n_jmx_exporter/jmx_prometheus_javaagent.yml\nAny existing settings are removed.\nFor reference, the default settings are:\n-Djava.io.tmpdir=/local_disk0/tmp -XX:ReservedCodeCacheSize=256m -\r\nXX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-executor-1 -\r\nDjava.security.properties=/databricks/spark/dbconf/java/extra.security -XX:+PrintFlagsFinal -\r\nXX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m -\r\nDjavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.Dataty\r\npeFactoryImpl -\r\nDjavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.Documen\r\ntBuilderFactoryImpl -\r\nDjavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFact\r\noryImpl -\r\nDjavax.xml.validation.SchemaFactory=\nhttps://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory\n-\r\nDorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -\r\nDorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMX\r\nSImplementationSourceImpl\nSolution\nTo add a new configuration setting to\nspark.executor.extraJavaOptions\nwithout losing the default settings:\nIn\nSpark UI\n>\nEnvironment\n>\nSpark Properties\n, select and copy all of the properties set by default for\nspark.executor.extraJavaOptions\n.\nClick\nEdit\n.\nIn the\nSpark config\ntext area (\nClusters\n>\ncluster-name\n>\nAdvanced Options\n>\nSpark\n), paste the default settings.\nAppend the new configuration setting below the default settings.\nClick outside the text area, then click\nConfirm\n.\nRestart the cluster.\nFor example, let’s say you paste the following settings into the\nSpark config\ntext area. The new configuration setting is appended to the default settings.\nspark.executor.extraJavaOptions = -Djava.io.tmpdir=/local_disk0/tmp -\r\nXX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing -Ddatabricks.serviceName=spark-\r\nexecutor-1 -Djava.security.properties=/databricks/spark/dbconf/java/extra.security -\r\nXX:+PrintFlagsFinal -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Xss4m -\r\nDjavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.Dataty\r\npeFactoryImpl -\r\nDjavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentB\r\nuilderFactoryImpl -\r\nDjavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactor\r\nyImpl -\r\nDjavax.xml.validation.SchemaFactory:\nhttps://www.w3.org/2001/XMLSchema=com.sun.org.apache.xer\nces.internal.jaxp.validation.XMLSchemaFactory -\r\nDorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser -\r\nDorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplem\r\nentationSourceImpl -\r\njavaagent:/opt/prometheus_jmx_exporter/jmx_prometheus_javaagent.jar=9404:/opt/prometheus_jm\r\nx_exporter/jmx_prometheus_javaagent.yml\nAfter you restart the cluster, the default settings and newly added configuration setting appear in\nSpark UI\n>\nEnvironment\n>\nSpark Properties\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/configure-simba-azure-ad-creds.json b/scraped_kb_articles/configure-simba-azure-ad-creds.json new file mode 100644 index 0000000000000000000000000000000000000000..cfa1324f7c5cd2377fcdf9e6beafc3bf45c39672 --- /dev/null +++ b/scraped_kb_articles/configure-simba-azure-ad-creds.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/bi/configure-simba-azure-ad-creds", + "title": "Título do Artigo Desconhecido", + "content": "This article describes how to access Azure Databricks with a Simba JDBC driver using Azure AD authentication.\nThis can be useful if you want to use an Azure AD user account to connect to Azure Databricks.\nDelete\nInfo\nPower BI has native support for Azure AD authentication with Azure Databricks. Review the\nPower BI\ndocumentation for more information.\nCreate a service principal\nCreate a service principal in Azure AD. The service principal obtains an access token for the user.\nOpen the\nAzure Portal\n.\nOpen the\nAzure Active Directory\nservice.\nClick\nApp registrations\nin the left menu.\nClick\nNew registration\n.\nComplete the form and click\nRegister\n.\nYour service principal has been successfully created.\nConfigure service principal permissions\nOpen the service principal you created.\nClick\nAPI permissions\nin the left menu.\nClick\nAdd a permission\n.\nClick\nAzure Rights Management Services\n.\nClick\nDelegated permissions\n.\nSelect\nuser_impersonation\n.\nClick\nAdd permissions\n.\nThe\nuser_impersonation\npermission is now assigned to your service principal.\nDelete\nInfo\nIf\nGrant admin consent\nis not enabled, you may encounter an error later on in the process.\nUpdate service principal manifest\nClick\nManifest\nin the left menu.\nLook for the line containing the\n\"allowPublicClient\"\nproperty.\nSet the value to\ntrue\n.\nClick\nSave\n.\nDownload and configure the JDBC driver\nDownload the\nDatabricks JDBC Driver\n.\nConfigure the JDBC driver\nas detailed in the documentation.\nObtain the Azure AD token\nUse the sample code to obtain the Azure AD token for the user.\nReplace the variables with values that are appropriate for your account.\n%python\r\n\r\nfrom adal import AuthenticationContext\r\n\r\nauthority_host_url = \"https://login.microsoftonline.com/\"\"\r\n# Application ID of Azure Databricks\r\nazure_databricks_resource_id = \"2ff814a6-3304-4ab8-85cb-cd0e6f879c1d\"\r\n\r\n# Required user input\r\nuser_parameters = {\r\n   \"tenant\" : \"\",\r\n   \"client_id\" : \"\",\r\n   \"username\" : \"\",\r\n   \"password\" : \r\n}\r\n\r\n# configure AuthenticationContext\r\n# authority URL and tenant ID are used\r\nauthority_url = authority_host_url + user_parameters['tenant']\r\ncontext = AuthenticationContext(authority_url)\r\n\r\n# API call to get the token\r\ntoken_response = context.acquire_token_with_username_password(\r\n  azure_databricks_resource_id,\r\n  user_parameters['username'],\r\n  user_parameters['password'],\r\n  user_parameters['client_id']\r\n)\r\n\r\naccess_token = token_response['accessToken']\r\nrefresh_token = token_response['refreshToken']\nPass the Azure AD token to the JDBC driver\nNow that you have the user’s Azure AD token, you can pass it to the JDBC driver using\nAuth_AccessToken\nin the JDBC URL as detailed in the\nBuilding the connection URL for the Databricks driver\ndocumentation.\nThis sample code demonstrates how to pass the Azure AD token.\n%python\r\n\r\n# Install jaydebeapi pypi module (used for demo)\r\n\r\nimport jaydebeapi\r\nimport pandas as pd\r\n\r\nimport os os.environ[\"CLASSPATH\"] = \"\"\r\n\r\n# JDBC connection string\r\nurl=\"jdbc:spark://adb-111111111111xxxxx.xx.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o//;AuthMech=11;Auth_Flow=0;Auth_AccessToken={0}\".format(access_token)\r\n\r\ntry:\r\n  conn=jaydebeapi.connect(\"com.simba.spark.jdbc.Driver\", url)\r\n  cursor = conn.cursor()\r\n\r\n  # Execute SQL query\r\n  sql=\"select * from \"\r\n  cursor.execute(sql)\r\n  results = cursor.fetchall()\r\n  column_names = [x[0] for x in cursor.description]\r\n  pdf = pd.DataFrame(results, columns=column_names)\r\n  print(pdf.head())\r\n\r\n # Uncomment the following two lines if this code is running in the Databricks Connect IDE or within a workspace notebook.\r\n # df = spark.createDataFrame(pdf)\r\n # df.show()\r\n\r\nfinally:\r\n        if cursor is not None:\r\n            cursor.close()" +} \ No newline at end of file diff --git a/scraped_kb_articles/configure-simba-proxy-windows.json b/scraped_kb_articles/configure-simba-proxy-windows.json new file mode 100644 index 0000000000000000000000000000000000000000..92a4b919f91cddec08b5d5ade0f1d28dce939159 --- /dev/null +++ b/scraped_kb_articles/configure-simba-proxy-windows.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/bi/configure-simba-proxy-windows", + "title": "Título do Artigo Desconhecido", + "content": "In this article you learn how to configure the\nDatabricks ODBC Driver\nwhen your local Windows machine is behind a proxy server.\nDownload the Simba driver for Windows\nDownload and install the latest version of the\nDatabricks ODBC Driver\nfor Windows.\nAdd proxy settings to the Windows registry\nOpen the Windows registry and add the proxy settings to the Simba Spark ODBC Driver key.\nOpen the Windows Registry Editor.\nNavigate to the\nHKEY_LOCAL_MACHINE\\SOFTWARE\\Simba\\Simba Spark ODBC Driver\\Driver\nkey.\nClick Edit.\nSelect New.\nClick String Value.\nEnter UseProxy as the Name and 1 as the Data value.\nRepeat this until you have added the following string value pairs:\nName\nProxyHost\nData\n\nName\nProxyPort\nData\n\nName\nProxyUID\nData\n\nName\nProxyPWD\nData\n\nClose the registry editor.\nConfigure settings in ODBC Data Source Administrator\nOpen the\nODBC Data Sources\napplication.\nClick the\nSystem DSN\ntab.\nSelect the\nSimba Spark ODBC Driver\nand click\nConfigure\n.\nEnter the connection information of your Apache Spark server.\nClick\nAdvanced Options\n.\nEnable the\nDriver Config Take Precedence\ncheck box.\nClick\nOK\n.\nClick\nOK\n.\nClick\nOK\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/conflicting-directory-structures-error.json b/scraped_kb_articles/conflicting-directory-structures-error.json new file mode 100644 index 0000000000000000000000000000000000000000..e33287e9029b558c0b0f7ddd2a6f697837daaf34 --- /dev/null +++ b/scraped_kb_articles/conflicting-directory-structures-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/conflicting-directory-structures-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an Apache Spark job that is failing with a Java assertion error\njava.lang.AssertionError: assertion failed: Conflicting directory structures detected.\nExample stack trace\nCaused by: org.apache.spark.sql.streaming.StreamingQueryException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list')\r\n=== Streaming Query ===\r\nIdentifier: [id = aabc5549-cb4b-4e4e-9403-4e793f4824a0, runId = 4e743dda-909f-4932-9489-3dd0b364d811]\r\nCurrent Committed Offsets: {}\r\nCurrent Available Offsets: {CloudFilesSource[://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt]: {'seqNum':423,'sourceVersion':1}}\r\n\r\nCurrent State: ACTIVE\r\nThread State: RUNNABLE\r\n\r\nLogical Plan:\r\nCloudFilesSource[://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt]\r\nat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:385)\r\nat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:268)\r\nCaused by: java.lang.RuntimeException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list')\r\nat com.databricks.sql.fileNotification.autoIngest.CloudFilesErrors$.partitionInferenceError(CloudFilesErrors.scala:115)\r\nat com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.liftedTree1$1(CloudFilesSourceFileIndex.scala:65)\r\nat com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.partitionSpec(CloudFilesSourceFileIndex.scala:63)\r\nat org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)\r\nat com.databricks.sql.fileNotification.autoIngest.CloudFilesSource.getBatch(CloudFilesSource.scala:361)\r\n... 1 more\r\nCaused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:\r\n://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt\r\n://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt\r\n\r\nIf provided paths are partition directories, please set 'basePath' in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.\r\nat scala.Predef$.assert(Predef.scala:223)\r\nat org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:204)\r\nat org.apache.spark.sql.execution.datasources.PartitioningUtils$.parseP\nCause\nYou have conflicting directory paths in the storage location.\nIn the example stack trace, we see two conflicting directory paths.\n://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt\n://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt\nBecause these directories appear in the same hierarchy, an update in root or in a branch level can result in a conflict.\nSolution\nAvoid multiple concurrent updates in a hierarchical directory structure or updates happening in the same partition.\nYou should make multiple distinct paths for updates once a conflict is detected. Alternatively, you can add more partitions.\nThese example directories do not conflict.\n://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt1\n://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt2" +} \ No newline at end of file diff --git a/scraped_kb_articles/connection-pool-is-full-error-when-pulling-models-from-s3-with-mlflow.json b/scraped_kb_articles/connection-pool-is-full-error-when-pulling-models-from-s3-with-mlflow.json new file mode 100644 index 0000000000000000000000000000000000000000..028f6d00306890c44eefab547cbbb82303b1d63f --- /dev/null +++ b/scraped_kb_articles/connection-pool-is-full-error-when-pulling-models-from-s3-with-mlflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/connection-pool-is-full-error-when-pulling-models-from-s3-with-mlflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are pulling models from S3 using MLflow when you get a\nConnection pool is full\nerror message.\nWARNING:urllib3.connectionpool:Connection pool is full, discarding connection: XXXXXXXX.XXX.XXXXXXXX.com. Connection pool size: 10\nCause\nThe warning message is generated when the connection pool in\nurllib3\nreaches its maximum size. The default maximum size of the connection pool is\n10\n, which may not be sufficient for some use cases.\nSolution\nThe warning message is benign and does not result in data loss. To avoid seeing these warning messages, you can increase the maximum size of the connection pool by setting the\nMLFLOW_HTTP_POOL_CONNECTIONS\nand\nMLFLOW_HTTP_POOL_MAXSIZE\nenvironment variables to a value larger than\n10\n, such as\n30\n.\nFor more information, review the\nmlflow.environment_variables\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/connection-refused-error-when-trying-to-connect-to-an-external-service-from-databricks.json b/scraped_kb_articles/connection-refused-error-when-trying-to-connect-to-an-external-service-from-databricks.json new file mode 100644 index 0000000000000000000000000000000000000000..227fafa2c1742d15522b55eb7b52e31844c94d7a --- /dev/null +++ b/scraped_kb_articles/connection-refused-error-when-trying-to-connect-to-an-external-service-from-databricks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/connection-refused-error-when-trying-to-connect-to-an-external-service-from-databricks", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen connecting Databricks to external services, such as an SQL server, Azure storage accounts, or Amazon RDS instances, you receive a\nConnection Refused\nerror.\nCause\nYou may have a network misconfiguration, firewall restriction, or authentication error.\nWith network misconfigurations, you may have incorrect subnets, route tables, or DNS settings that block communication with external services.\nIf your network firewall doesn’t include the external service on its access list, any attempt to connect may be refused.\nSolution\nCheck your DNS resolution and port connectivity. If either or both of these checks fail, it confirms an issue related to any of the causes mentioned in the previous section, and requires communicating with your internal network team first.\nFirst, identify the service you are trying to connect to and get the hostname.\nNote\nIn a given URL,\nhttps://www.domainname.com\nthe hostname is the\ndomainname.com\npart. You must pass in only the hostname.\nCheck DNS resolution\nRun the following command in a Databricks notebook using the same cluster specifications as when you faced the issue.\n%sh dig +short \nIf the output is blank, DNS is not resolving the hostname, which can cause connection failures.\nIf you see an IP address, confirm that it matches the specific, expected IP address for the external service you are trying to connect to.\nReport connection failures or a mismatching IP address to your internal networking team.\nCheck port connectivity\nFind the port number for the service you want to connect to and run the following command. Different services work on different port numbers, For example, the default port for the Azure SQL server is 1433.\n%sh nc -vz \nIf you receive the response\nConnection to port (tcp) failed: Operation timed out\n, report this to your internal networking team.\nIf you have verified with your internal networking team that DNS resolution and port connectivity are working, and still experience an issue, contact Databricks support. Include all command output screenshots when filing a ticket for more efficient issue resolution." +} \ No newline at end of file diff --git a/scraped_kb_articles/connection-retries-take-a-long-time-to-fail.json b/scraped_kb_articles/connection-retries-take-a-long-time-to-fail.json new file mode 100644 index 0000000000000000000000000000000000000000..386096dea9f7c180e67169a4e7e61c28e550bad7 --- /dev/null +++ b/scraped_kb_articles/connection-retries-take-a-long-time-to-fail.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/connection-retries-take-a-long-time-to-fail", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to access a table on a remote HDFS location or an object store that you do not have permission to access. The\nSELECT\ncommand should fail, and it does, but it does not fail quickly. It can take up to ten minutes, sometimes more, to return a\nConnectTimeoutException\nerror message.\nThe error message they eventually receive is : \"\r\nError in SQL statement: ConnectTimeoutException: Call From 1006-163012-faded894-10-133-241-86/127.0.1.1 to\nanalytics.aws.healthverity.com\n:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=\nanalytics.aws.healthverity.com/10.24.12.199:8020\n]; For more details see: SocketTimeout - HADOOP2 - Apache Software Foundation \"\nCause\nEverything is working as designed, however the default Apache Hadoop values for connection timeout and retry are high, which is why the connection does not fail quickly.\nipc.client.connect.timeout 20000\r\nipc.client.connect.max.retries.on.timeouts 45\nReview the\ncomplete list of Hadoop common core-default.xml values\n.\nReview the\nSocketTimeout\ndocumentation for more details.\nSolution\nYou can resolve the issue by reducing the values for connection timeout and retry.\nThe\nipc.client.connect.timeout\nvalue is in seconds.\nThe\nipc.client.connect.max.retries.on.timeouts\nvalue is the number of times to retry before failing.\nSet these values in your cluster's\nSpark config\n(\nAWS\n|\nAzure\n).\nIf you are not sure what values to use, these are Databricks recommended values:\nipc.client.connect.timeout 5000\r\nipc.client.connect.max.retries.on.timeouts 3" +} \ No newline at end of file diff --git a/scraped_kb_articles/connection-timeout-error-when-trying-to-connect-to-snowflake-from-databricks-environments.json b/scraped_kb_articles/connection-timeout-error-when-trying-to-connect-to-snowflake-from-databricks-environments.json new file mode 100644 index 0000000000000000000000000000000000000000..0a3e9e055b03d437853e34b68f4bb0d0a9f99b05 --- /dev/null +++ b/scraped_kb_articles/connection-timeout-error-when-trying-to-connect-to-snowflake-from-databricks-environments.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/connection-timeout-error-when-trying-to-connect-to-snowflake-from-databricks-environments", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to connect to Snowflake from Databricks environments, trying to read from or write to Snowflake tables, or execute Snowflake queries, you receive an error message.\nConnection to Snowflake timed out.\nCause\nConnectivity issues between Databricks and Snowflake can arise from various factors, including incorrect network configurations such as firewall rules and VPC settings, DNS resolution problems, misconfigured proxy settings, account identifier mismatches, high network latency, and SSL/TLS certificate validation issues.\nSolution\nTo effectively diagnose these challenges, Snowflake offers the Snowflake Connectivity Diagnostic (SnowCD) utility, which you can integrate into your Databricks environment for comprehensive network testing.\nGet SnowCD ready in Databricks\nFirst, create an init script to download and install SnowCD on your Databricks cluster.\nOpen a new text file in your workspace, and paste in the following code snippet.\n#!/bin/bash\r\nwget https://sfc-repo.snowflakecomputing.com/snowcd/linux_amd64/latest/snowcd -O /usr/local/bin/snowcd\r\nchmod +x /usr/local/bin/snowcd\nNext, save this script in your workspace files as\ninstall_snowcd.sh\n. Upload the script file directly to the desired location in your workspace, ensuring the path follows the format\n“/Workspace//install_snowcd.sh”\n.\nThen, configure your Databricks cluster to use the init script.\nGo to your cluster and click it to open your configuration UI.\nUnder\nAdvanced Options\n, select\nInit Scripts\nAdd the path to your script:\n/Workspace//install_snowcd.sh\nSave your cluster settings.\nObtain Snowflake endpoints\nThis step is like getting a map of all the places Snowflake lives on the internet.\nFirst, connect to your Snowflake account using the web interface or SnowSQL.\nNext, execute the following SQL query.\nsql\r\nSELECT SYSTEM$ALLOWLIST();\nNote\nIf you're using a private link connection, use\nSYSTEM$ALLOWLIST_PRIVATELINK()\ninstead.\nLast, save the JSON output to a file named\nallowlist.json\nin your Databricks workspace.\nRun SnowCD in Databricks\nUse SnowCD to check the connection.\nFirst, create a new notebook in Databricks and use the following Python code to run SnowCD.\nimport subprocess\r\nimport json\r\n\r\n \r\n# Path to the allowlist.json file in Databricks\r\nallowlist_path = \"/dbfs/path/to/your/allowlist.json\"\r\n\r\n\r\n# Run SnowCD\r\nresult = subprocess.run([\"snowcd\", allowlist_path], capture_output=True, text=True)\r\n\r\n\r\n# Print the output\r\nprint(result.stdout)\r\n\r\n\r\n# Check for errors\r\nif result.returncode != 0:\r\n    print(\"Error occurred:\")\r\n    print(result.stderr)\r\nelse:\r\n    print(\"All checks passed successfully\")\nAfter running the test, you'll see output in the result. If all checks pass, you'll see\n\"All checks passed successfully\"\n.\nIf there are issues, SnowCD will provide detailed error messages about which endpoints couldn't be reached.\nTroubleshoot based on SnowCD output\nIf SnowCD reports DNS lookup failures, work with your internal network team to ensure proper DNS resolution for Snowflake endpoints.\nFor connection failures, review and update firewall rules to allow traffic to Snowflake IP addresses and ports identified in the\nallowlist.json\nfile.\nIf using a proxy, ensure it's correctly configured in your Databricks environment and can handle Snowflake connections.\nFor SSL/TLS-related errors, verify that your Databricks cluster supports the required TLS version and cipher suites for Snowflake.\nFor account identifier issues, ensure you're using the correct Snowflake account identifier in your connection strings.\nFor more information, refer to Snowflake’s\nSnowCD (Connectivity Diagnostic Tool)\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/connection-to-external-sources-after-configuring-proxy-settings-is-failing.json b/scraped_kb_articles/connection-to-external-sources-after-configuring-proxy-settings-is-failing.json new file mode 100644 index 0000000000000000000000000000000000000000..db58fcc58428109dc6873303d0f196ae04be7e59 --- /dev/null +++ b/scraped_kb_articles/connection-to-external-sources-after-configuring-proxy-settings-is-failing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/connection-to-external-sources-after-configuring-proxy-settings-is-failing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou configure the following proxy settings in a notebook.\nHTTP_PROXY=http://10.24.xxx.xxx:443\r\nHTTPS_PROXY=https://10.24.xxx.xxx:443\nYou then encounter issues such as the following.\nError messages indicating failed connections to external repositories.\nFailure to download libraries or dependencies.\nInability to access certain websites or services due to proxy configuration issues.\nCause\nConfiguring proxy settings at the cluster level prevents the required fully qualified domain names (FQDNs) from being allowlisted in the firewall.\nYou may also, or instead, have Apache Spark configurations or incorrect proxy settings. The specific cause depends on the technical details of your Databricks environment.\nSolution\nAllowlist the required FQDNs, remove unnecessary Spark configuration properties, then configure and test your proxy settings. Last, verify connectivity.\nAllowlist required FQDNs\nEnsure that the required FQDNs are allowlisted in your organization’s proxy server config to allow access to external repositories. Contact the team in your organization responsible for the proxy servers.\nRemove unnecessary properties\nReview your cluster Spark config properties and remove unnecessary ones. For details on how to modify Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nConfigure proxy settings\nConfigure the proxy settings at the cluster level. Add the following configs to your cluster settings.\nReplace\n*.example.com\nwith the hostname of the service excluded from using the proxy. For example,\n*.snowflakecomputing.com\nfor Snowflake services.\nspark.driver.extraJavaOptions=\"\r\n-Dhttp.proxyHost=10.24.132.XXX -Dhttp.proxyPort=80 -Dhttps.proxyHost=10.24.132.XXX -Dhttps.proxyPort=443 \r\n-Dhttp.nonProxyHosts=*.example.com -Dhttps.nonProxyHosts=*.example.com\"\r\n\r\nspark.executor.extraJavaOptions=\"\r\n-Dhttp.proxyHost=10.24.132.XXX -Dhttp.proxyPort=80 -Dhttps.proxyHost=10.24.132.XXX -Dhttps.proxyPort=443 -Dhttp.nonProxyHosts=*.example.com -Dhttps.nonProxyHosts=*.example.com\"\nTest proxy access\nDo a telnet test to check if your cluster has access to your proxy. The following code provides an example.\ntelnet 10.24.132.XXX 443\nVerify connectivity\nAfter you’ve completed the above adjustments and checks, verify that connectivity to external repositories works correctly by running the following two netcat test commands and curl command.\nnetcat test directly to the target hostname\nnc -vz 443\nnetcat test with proxy\nnc -X connect -x : \nThe following code provides an example.\nnc -X connect -x 127.0.0.1:8080 google.com 443\ncurl test with proxy\ncurl --proxy http://: https://" +} \ No newline at end of file diff --git a/scraped_kb_articles/content-size-error-when-trying-to-import-or-export-a-notebook.json b/scraped_kb_articles/content-size-error-when-trying-to-import-or-export-a-notebook.json new file mode 100644 index 0000000000000000000000000000000000000000..dd40be46de12dbe44da06bc43e98e138639a7178 --- /dev/null +++ b/scraped_kb_articles/content-size-error-when-trying-to-import-or-export-a-notebook.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/content-size-error-when-trying-to-import-or-export-a-notebook", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to import or export a Databricks notebook when you get a content size error. This can happen when using the API, CLI, or Terraform provider.\ncontent size (xxxx) exceeded the limit 10485760.\nCause\nThere is a size limit of 10MB per notebook. Trying to import or export a notebook larger than 10MB generates an error.\nSolution\nReduce the size of your notebook so it is under 10MB.\nWays to reduce the notebook side include:\nClear cell outputs. This removes any results stored in the notebook and can quickly lower the size.\nSplit large notebooks into multiple smaller notebooks and use\n%run\nor other techniques to run those smaller notebooks in a larger file. For more information, review the\nOrchestrate notebooks and modularize code in notebooks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nRemove unnecessary cells. If your notebook contains cells that are not essential, removing them can help reduce the size of your notebook.\nConsider limiting the number of rows returned by your queries. Add filters to your query to remove unnecessary records from the notebook.\nExport the notebook as source files such as Python (.py) or Scala (.scala).\nIf you cannot reduce the size of your notebook below 10MB, reach out to Databricks support for assistance. You will need to share the notebook URL, a description of why the file size cannot be trimmed, and the specific error message." +} \ No newline at end of file diff --git a/scraped_kb_articles/convert-datetime-to-string.json b/scraped_kb_articles/convert-datetime-to-string.json new file mode 100644 index 0000000000000000000000000000000000000000..4aafd922e319cb3fd1a6dc421609578ecf9da4cc --- /dev/null +++ b/scraped_kb_articles/convert-datetime-to-string.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/convert-datetime-to-string", + "title": "Título do Artigo Desconhecido", + "content": "There are multiple ways to display date and time values with Python, however not all of them are easy to read.\nFor example, when you collect a timestamp column from a DataFrame and save it as a Python variable, the value is stored as a datetime object. If you are not familiar with the datetime object format, it is not as easy to read as the common YYYY-MM-DD HH:MM:SS format.\nIf you wanted to print the date and time, or maybe use it for timestamp validation, you can convert the datetime object to a string. This automatically converts the datetime object into a common time format.\nIn this article, we show you how to display the timestamp as a column value, before converting it to a datetime object, and finally, a string value.\nDisplay timestamp as a column value\nTo display the current timestamp as a column value, you should call\ncurrent_timestamp()\n.\nThis provides the date and time as of the moment it is called.\n%python\r\n\r\nfrom pyspark.sql.functions import *\r\ndisplay(spark.range(1).withColumn(\"date\",current_timestamp()).select(\"date\"))\nSample output:\nAssign timestamp to datetime object\nInstead of displaying the date and time in a column, you can assign it to a variable.\n%python\r\n\r\nmydate = spark.range(1).withColumn(\"date\",current_timestamp()).select(\"date\").collect()[0][0]\nOnce this assignment is made, you can call the variable to display the stored date and time value as a datetime object.\n%python\r\n\r\nmydate\nSample output:\ndatetime.datetime(2021, 6, 25, 11, 0, 56, 813000)\nDelete\nInfo\nThe date and time is current as of the moment it is assigned to the variable as a datetime object, but the datetime object value is static unless a new value is assigned.\nConvert to string\nYou can convert the datetime object to a string by calling\nstr()\non the variable. Calling\nstr()\njust converts the datetime object to a string. It does not update the value with the current date and time.\n%python\r\n\r\nstr(mydate)\nSample output:\n'2021-06-25 11:00:56.813000'" +} \ No newline at end of file diff --git a/scraped_kb_articles/convert-flat-df-to-nested-json.json b/scraped_kb_articles/convert-flat-df-to-nested-json.json new file mode 100644 index 0000000000000000000000000000000000000000..097296938940b1fbb58457b00a6951e87e8cdf95 --- /dev/null +++ b/scraped_kb_articles/convert-flat-df-to-nested-json.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/convert-flat-df-to-nested-json", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to convert a flattened DataFrame to a nested structure, by nesting a case class within another case class.\nYou can use this technique to build a JSON file, that can then be sent to an external API.\nDefine nested schema\nWe’ll start with a flattened DataFrame.\nUsing this example DataFrame, we define a custom nested schema using case classes.\n%scala\r\n\r\ncase class empId(id:String)\r\ncase class depId(dep_id:String)\r\ncase class details(id:empId,name:String,position:String,depId:depId)\r\ncase class code(manager_id:String)\r\ncase class reporting(reporting:Array[code])\r\ncase class hireDate(hire_date:String)\r\ncase class emp_record(emp_details:details,incrementDate:String,commission:String,country:String,hireDate:hireDate,reports_to:reporting)\nYou can see that the case classes nest different data types within one another.\nConvert flattened DataFrame to a nested structure\nUse\nDF.map\nto pass every row object to the corresponding case class.\n%scala\r\n\r\nimport spark.implicits._\r\nval nestedDF= DF.map(r=>{\r\nval empID_1= empId(r.getString(0))\r\nval depId_1 = depId(r.getString(7))\r\nval details_1=details(empID_1,r.getString(1),r.getString(2),depId_1)\r\nval code_1=code(r.getString(3))\r\nval reporting_1 = reporting(Array(code_1))\r\nval hireDate_1 = hireDate(r.getString(4))\r\nemp_record(details_1,r.getString(8),r.getString(6),r.getString(9),hireDate_1,reporting_1)\r\n\r\n}\r\n)\nThis creates a nested DataFrame.\nWrite out nested DataFrame as a JSON file\nUse the\nrepartition().write.option\nfunction to write the nested DataFrame to a JSON file.\n%scala\r\n\r\nnestedDF.repartition(1).write.option(\"multiLine\",\"true\").json(\"dbfs:/tmp/test/json1/\")\nExample notebook\nReview the\nDataFrame to nested JSON example notebook\nto see each of these steps performed." +} \ No newline at end of file diff --git a/scraped_kb_articles/copy-installed-libraries-from-one-cluster-to-another.json b/scraped_kb_articles/copy-installed-libraries-from-one-cluster-to-another.json new file mode 100644 index 0000000000000000000000000000000000000000..3c7c2211349f258dd5bf7554948c6920fa101c04 --- /dev/null +++ b/scraped_kb_articles/copy-installed-libraries-from-one-cluster-to-another.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/copy-installed-libraries-from-one-cluster-to-another", + "title": "Título do Artigo Desconhecido", + "content": "If you have a highly customized Databricks cluster, you may want to duplicate it and use it for other projects. When you clone a cluster, only the Apache Spark configuration and other cluster configuration information is copied. Installed libraries are not copies by default.\nTo copy the installed libraries, you can run a Python script after cloning the cluster.\nInstructions\nIdentify source and target\nThe source cluster is the cluster you want to copy from.\nThe target cluster is the cluster you want to copy to.\nYou can find the\n\nand the\n\nby selecting the cluster in the workspace, and then looking for the cluster ID in the URL.\nhttps:///#/setting/clusters/\nIn the following screenshot, the cluster ID is\n0801-112947-n650q4k\n.\nCreate a Databricks personal access token\nFollow the\nPersonal access tokens for users\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to create a personal access token.\nCreate a secret scope\nFollow the\nCreate a Databricks-backed secret scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to create a secret scope.\nStore your personal access token and your Databricks instance in the secret scope\nFollow the\nCreate a secret in a Databricks-backed scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to store the personal access token you created and your Databricks instance as new secrets within your secret scope.\nYour Databricks instance is the hostname for your workspace, for example, xxxxx.cloud.databricks.com.\nUse a Python script to clone the installed libraries\nYou can use this example Python script to copy the installed libraries from a source cluster to a target cluster.\nYou need to replace the following values in the script before running:\n\n- The name of your scope that holds the secrets.\n\n- The name of the secret that holds your Databricks instance.\n\n- The name of the secret that holds your personal access token.\n\n- The cluster ID of the cluster you want to copy FROM.\n\n- The cluster ID of the cluster you want to copy TO.\nCopy the example script into a notebook that is attached to a running cluster in your workspace.\n%python\r\n\r\nimport requests\r\nimport json\r\nimport time\r\nfrom pyspark.sql.types import (StructField, StringType, StructType, IntegerType)\r\n\r\nAPI_URL = dbutils.secrets.get(scope = \"\", key = \"\")    #\nhttps://xxxxx.cloud.databricks.com/\nTOKEN = dbutils.secrets.get(scope = \"\", key = \"\")    # Databricks PAT token\r\nsource_cluster_id = \"\"\r\ntarget_cluster_id = \"\"\r\nsource_cluster_api_url = API_URL+\"/api/2.0/libraries/cluster-status?cluster_id=\" + \r\nresponse = requests.get(source_cluster_api_url,headers={'Authorization': \"Bearer \" + })\r\nlibraries = []\r\nfor library_info in  response.json()['library_statuses']:\r\n  lib_type = library_info['library']\r\n  status = library_info['status']\r\n  libraries.append(lib_type)\r\n    \r\nprint(\"libraries from source cluster (\"+source_cluster_id+\") : \"+str(libraries)+\"\\n\")\r\ntarget_cluster_api_url = API_URL +\"/api/2.0/libraries/install\"\r\n \r\ntarget_lib_install_payload = json.dumps({'cluster_id': target_cluster_id, 'libraries': libraries})\r\nprint(\"Installing libraries in target cluster (\"+source_cluster_id+\") with payload: \"+str(target_lib_install_payload)+\"\\n\")\r\nresponse = requests.post(target_cluster_api_url, headers={'Authorization': \"Bearer \" + TOKEN}, data = target_lib_install_payload)\r\nif response.status_code ==200:\r\n  print(\"Installation request is successful.Response code :\"+str(response.status_code))\r\nelse:\r\n  print(\"Installation failed.Response code :\"+str(response.status_code))\nTest target cluster\nAfter the script finishes running, start the target cluster and verify that the libraries have been copied over." +} \ No newline at end of file diff --git a/scraped_kb_articles/copy-into-command-failing-on-partition-columns-with-string-data-types-that-start-with-an-integer.json b/scraped_kb_articles/copy-into-command-failing-on-partition-columns-with-string-data-types-that-start-with-an-integer.json new file mode 100644 index 0000000000000000000000000000000000000000..305c5fc068689d478516df6d07a6e184a7637bdd --- /dev/null +++ b/scraped_kb_articles/copy-into-command-failing-on-partition-columns-with-string-data-types-that-start-with-an-integer.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/copy-into-command-failing-on-partition-columns-with-string-data-types-that-start-with-an-integer", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re using a Databricks notebook try to transfer data from a source to a sink location, particularly in the context of Delta tables. When you use the\nCOPY INTO\ncommand on a column designated for partitioning that has data type\nSTRING\nand begins with a numeric value, you receive an error. For example, a column “Date” has data type\nSTRING\nand has a value starting with a number, 2025-01-01.\nClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String\nCause\nThe file index logic infers the partition column types by default. A partition column with data formatted “yyyy-mm-dd“ is inferred into DateType, and then casting StringType to DateType generates an integer. When the Parquet reader tries to get a UTF8String partition value and finds an integer, it fails with an error.\nSolution\nIn the same notebook, disable the partition column type inference while reading data using the following configuration.\nSET spark.sql.sources.partitionColumnTypeInference.enabled = false" +} \ No newline at end of file diff --git a/scraped_kb_articles/copy-into-not-loading-new-data-to-destination-table.json b/scraped_kb_articles/copy-into-not-loading-new-data-to-destination-table.json new file mode 100644 index 0000000000000000000000000000000000000000..b393963d32678b8d0e502fdab221335bc53ee583 --- /dev/null +++ b/scraped_kb_articles/copy-into-not-loading-new-data-to-destination-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/copy-into-not-loading-new-data-to-destination-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the\nCOPY INTO\ncommand to load data into a Unity Catalog table, you notice new data added to the source file is not copying into the table.\nCause\nThe\nCOPY INTO\ncommand is designed to be idempotent, meaning that files in the source location that have already been loaded are skipped, even if the files have been modified since they were loaded. This is true even if you modify the table (for example, removing all rows) or modify the file.\nSolution\nInclude the\nforce\noption in the\nCOPY_OPTIONS\nparameters and set it to true. This disables idempotency and force file loading regardless of whether they were previously loaded.\nCOPY INTO ..\r\nFROM ''\r\nFILEFORMAT = \r\nFORMAT_OPTIONS (\r\n  'header' = 'true',\r\n  'inferSchema' = 'true'\r\n)\r\nCOPY_OPTIONS (\r\n  'mergeSchema' = 'false',\r\n  'force' = 'true'\r\n);\nFor more information, please refer to the\nCOPY INTO\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cors-policy-error-when-trying-to-run-databricks-api-from-a-browser-based-application.json b/scraped_kb_articles/cors-policy-error-when-trying-to-run-databricks-api-from-a-browser-based-application.json new file mode 100644 index 0000000000000000000000000000000000000000..bac87abad4ab82234477c0383fb35aedb8c99924 --- /dev/null +++ b/scraped_kb_articles/cors-policy-error-when-trying-to-run-databricks-api-from-a-browser-based-application.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/cors-policy-error-when-trying-to-run-databricks-api-from-a-browser-based-application", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to run Databricks APIs from a browser-based web application, you see a CORS (cross-origin resource sharing) policy error.\nError : Access to fetch at 'https://dbc-xxxxx-xxx.cloud.databricks.com/api/2.0/sql/statements/' from origin 'https:/' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled. Understand this error\r\nzone-evergreen.js:1068\r\nPOST https:/dbc-xxxxx-xxx.cloud.databricks.com/api/2.0/sql/statements/ net::ERR_FAILED\nCause\nCORS is a security mechanism web browsers implement to restrict web pages from making requests to a different domain (such as\ndatabricks.com\n) than the one serving the web page.\nDatabricks does not allow CORS on most HTTP endpoints for security reasons.\nSolution\nSet up a backend server to sit between your browser application and the Databricks control plane, to handle communication with the Databricks control plane.\nEnsure the backend server has the same domain as the browser application.\nImplement OAuth machine-to-machine (M2M) authentication for secure communication. Obtain an OAuth token from Databricks and use it to authenticate API requests.\nUse any server-side technology, such as Node.js, Python (Flask or Django), Go, or other backend frameworks you prefer.\nEnsure that all requests from the web application are proxied through the backend server. The backend server will handle the OAuth authentication.\nFor more information on OAuth M2M, review the\nAuthenticate access to Databricks with a service principal using OAuth (OAuth M2M)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cosmosdb-connector-lib-conf.json b/scraped_kb_articles/cosmosdb-connector-lib-conf.json new file mode 100644 index 0000000000000000000000000000000000000000..679e5c77c0274065615e1fe44f7b3bb14846cd77 --- /dev/null +++ b/scraped_kb_articles/cosmosdb-connector-lib-conf.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/cosmosdb-connector-lib-conf", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to resolve an issue running applications that use the CosmosDB-Spark connector in the Databricks environment.\nProblem\nNormally if you add a Maven dependency to your Spark cluster, your app should be able to use the required connector libraries. But currently, if you simply specify the CosmosDB-Spark connector’s Maven co-ordinates as a dependency for the cluster, you will get the following exception:\njava.lang.NoClassDefFoundError: Could not initialize class com.microsoft.azure.cosmosdb.Document\nCause\nThis occurs because Spark 2.3 uses\njackson-databind-2.6.7.1\n, whereas the CosmosDB-Spark connector uses\njackson-databind-2.9.5\n. This creates a library conflict, and at the executor level you observe the following exception:\njava.lang.NoSuchFieldError: ALLOW_TRAILING_COMMA\r\nat com.microsoft.azure.cosmosdb.internal.Utils.(Utils.java:69)\nSolution\nTo avoid this problem:\nDirectly download the CosmosDB-Spark connector Uber JAR:\nazure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar\n.\nUpload the downloaded JAR to Databricks following the instructions in Upload a Jar, Python egg, or Python wheel (\nAWS\n|\nAzure\n).\nInstall the uploaded library as a Cluster-installed library (\nAWS\n|\nAzure\n)\nFor more information, see Azure Cosmos DB (\nAWS\n|\nAzure\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/count-of-corrupt_records-returns-zero-in-serverless.json b/scraped_kb_articles/count-of-corrupt_records-returns-zero-in-serverless.json new file mode 100644 index 0000000000000000000000000000000000000000..40ec21e9299fa28138897fdc6109bf91b89562b1 --- /dev/null +++ b/scraped_kb_articles/count-of-corrupt_records-returns-zero-in-serverless.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/count-of-corrupt_records-returns-zero-in-serverless", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to perform a count on corrupt records after passing a custom schema while reading a file, you encounter an issue where the count always returns as zero even if there are valid records in the DataFrame.\nExample\ndf = spark.read.schema().option(\"columnNameOfCorruptRecord\",\"_corrupt_record\").csv()\r\ncorrupt_records = df.filter(\"_corrupt_record is not null\")\r\ncorrupt_records.count()\nWhere\ncount()\ngives a result of\n0\neven though there are records in the\ncorrupt_records\ndata frame and multiple CSV files.\nCause\nWhen trying to filter records based on the corrupt record, column pruning optimization removes all other columns from the Apache Spark plan. Spark doesn’t read any data from the source, resulting in a zero record count.\nSolution\nCollect records in an array and get the count of the array instead.\nExample\ndf = spark.read.schema().option(\"columnNameOfCorruptRecord\",\"_corrupt_record\").csv()\r\ncorrupt_records = df.filter(\"_corrupt_record is not null\")\r\nlen(corrupt_records.collect())" +} \ No newline at end of file diff --git a/scraped_kb_articles/count-operation-on-a-dataframe-returning-zero-or-incorrect-number-of-records.json b/scraped_kb_articles/count-operation-on-a-dataframe-returning-zero-or-incorrect-number-of-records.json new file mode 100644 index 0000000000000000000000000000000000000000..d22c194d1a0fb1e026b09eee769ceb114c161c70 --- /dev/null +++ b/scraped_kb_articles/count-operation-on-a-dataframe-returning-zero-or-incorrect-number-of-records.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/count-operation-on-a-dataframe-returning-zero-or-incorrect-number-of-records", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile performing\nCOUNT\noperations on a DataFrame or temporary view created from a Delta table in Apache Spark, you notice the\nCOUNT\noperation intermittently returns zero or an incorrect number of records, even when the data exist.\nCause\nYou have parallel\nDELETE\nor\nUPDATE\noperations interfering with the\nCOUNT\noperation performed on the temporary view or cached DataFrame, leading to temporary record loss or outdated statistics.\nContext\nWhen\nDELETE\ntasks run simultaneously with\nCOUNT\nqueries, they modify the underlying data, which can result in the\nCOUNT\noperation observing empty tables or outdated statistics. Empty tables and outdated statistics occur because Spark’s execution triggers a local scan and recompute the DataFrame, invalidating cached states due to the detected changes in the source table.\nUsing the same compute cluster for both queries and data modifications further exacerbates inconsistent data count, leading to data volatility and inconsistencies in the results during real-time modifications of the table.\nSolution\nSchedule\nDELETE\nor\nUPDATE\noperations and\nCOUNT\nqueries to run sequentially instead of in parallel. This prevents temporary inconsistencies in the table and ensures accurate results.\nSave the DataFrame to a checkpoint or temporary table in a physical location or file. This creates a stable snapshot of the data that is immune to concurrent operations and reduces the risk of data loss in case of failures.\nUse snapshot isolation to provide a consistent view of the data throughout the duration of your operations. For example, you can query a specific version of the table.\ndf = spark.sql(\"SELECT * FROM .. VERSION AS OF 1\")\r\ndf.createOrReplaceTempView(\"\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/create-df-from-json-string-python-dictionary.json b/scraped_kb_articles/create-df-from-json-string-python-dictionary.json new file mode 100644 index 0000000000000000000000000000000000000000..53c62fbd9949465426e12e1a5cb6a560bbc67d3b --- /dev/null +++ b/scraped_kb_articles/create-df-from-json-string-python-dictionary.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/create-df-from-json-string-python-dictionary", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary.\nInfo\nA previous version of this article recommended using Scala for this use case. Databricks recommends using Python. All of the sample code in this article is written in Python.\nCreate a Spark DataFrame from a JSON string\n1. Add the JSON content from the variable to a list.\n%python\r\n\r\nfrom pyspark.sql import Row\r\nimport json\r\n\r\njson_content1 = \"\"\"{\"json_col1\": \"hello\", \"json_col2\": \"32\"}\"\"\"\r\njson_content2 = \"\"\"{\"json_col1\": \"hello\", \"json_col2\": \"world\"}\"\"\"\n2. Add the JSON content as a dictionary object to a python list.\njson_data = []\r\njson_data.append(json.loads(json_content1))\r\njson_data.append(json.loads(json_content2))\n3. Parse the list of dictionaries to create a Spark DataFrame.\nrows = [Row(**json_dict) for json_dict in json_data]\r\ndf = spark.createDataFrame(rows)\r\ndisplay(df)\nCombined sample code\nThis sample code block combines the previous steps into a single example.\n%python\r\n\r\nfrom pyspark.sql import Row\r\nimport json\r\n\r\njson_content1 = \"\"\"{\"json_col1\": \"hello\", \"json_col2\": \"32\"}\"\"\"\r\njson_content2 = \"\"\"{\"json_col1\": \"hello\", \"json_col2\": \"world\"}\"\"\"\r\n\r\n\r\njson_data = []\r\njson_data.append(json.loads(json_content1))\r\njson_data.append(json.loads(json_content2))\r\nrows = [Row(**json_dict) for json_dict in json_data]\r\ndf = spark.createDataFrame(rows)\r\ndisplay(df)\nExtract a string column with JSON data from a DataFrame and parse it\n1. Create a sample DataFrame.\n%python\r\n\r\nfrom pyspark.sql.functions import *\r\nfrom pyspark.sql.types import *\r\n\r\ndata = [(\"1\", \"{'json_col1': 'hello', 'json_col2': 32}\", \"1.0\"),(\"1\", \"{'json_col1': 'hello', 'json_col2': 'world'}\", \"1.0\")]\r\n\r\nschema = StructType([\r\n  StructField(\"id\", StringType()),\r\n  StructField(\"value\", StringType()),\r\n  StructField(\"token\", StringType())\r\n])\r\n\r\ndf = spark.createDataFrame(data, schema)\n2. Use from_json() to parse the JSON column when reading the DataFrame.\njson_schema = StructType([\r\n    StructField(\"json_col1\", StringType(), True),\r\n    StructField(\"json_col2\", StringType(), True)\r\n])\r\ndf2 = df.withColumn('json', from_json(col('value'), json_schema)).select(\"*\", \"json.*\")\r\ndisplay(df2)\nCombined sample code\nThis sample code block combines the previous steps into a single example.\n%python\r\n\r\nfrom pyspark.sql.functions import *\r\nfrom pyspark.sql.types import *\r\n\r\ndata = [(\"1\", \"{'json_col1': 'hello', 'json_col2': 32}\", \"1.0\"),(\"1\", \"{'json_col1': 'hello', 'json_col2': 'world'}\", \"1.0\")]\r\n\r\nschema = StructType([\r\n  StructField(\"id\", StringType()),\r\n  StructField(\"value\", StringType()),\r\n  StructField(\"token\", StringType())\r\n])\r\n\r\ndf = spark.createDataFrame(data, schema)\r\n\r\njson_schema = StructType([\r\n    StructField(\"json_col1\", StringType(), True),\r\n    StructField(\"json_col2\", StringType(), True)\r\n])\r\n\r\ndf2 = df.withColumn('json', from_json(col('value'), json_schema)).select(\"*\", \"json.*\")\r\ndisplay(df2)\nCreate a Spark DataFrame from a Python dictionary\n1. Check the data type and confirm that it is dictionary type.\n%python\r\n\r\nfrom pyspark.sql import Row\r\nimport json\r\njsonDataDict = {\"job_id\":33100,\"run_id\":1048560,\"number_in_job\":1,\"state\":{\"life_cycle_state\":\"PENDING\",\"state_message\":\"Waiting for cluster\"},\"task\":{\"notebook_task\":{\"notebook_path\":\"/Users/user@databricks.com/path/test_notebook\"}},\"cluster_spec\":{\"new_cluster\":{\"spark_version\":\"4.3.x-scala2.11\",\"attributes\":{\"type\":\"fixed_node\",\"memory\":\"8g\"},\"enable_elastic_disk\":\"false\",\"num_workers\":1}},\"cluster_instance\":{\"cluster_id\":\"0000-000000-wares10\"},\"start_time\":1584689872601,\"setup_duration\":0,\"execution_duration\":0,\"cleanup_duration\":0,\"creator_user_name\":\"user@databricks.com\",\"run_name\":\"my test job\",\"run_page_url\":\"https://testurl.databricks.com#job/33100/run/1\",\"run_type\":\"SUBMIT_RUN\"}\r\n\r\ntype(jsonDataDict)\n2. Add the dictionary to a list and parse the list to create a Spark DataFrame.\nrows = [Row(**json_dict) for json_dict in [jsonDataDict]]\r\ndf = spark.createDataFrame(rows)\r\ndf.display()\nCombined sample code\nThis sample code block combines the previous steps into a single example.\n%python\r\n\r\nfrom pyspark.sql import Row\r\nimport json\r\njsonDataDict = {\"job_id\":33100,\"run_id\":1048560,\"number_in_job\":1,\"state\":{\"life_cycle_state\":\"PENDING\",\"state_message\":\"Waiting for cluster\"},\"task\":{\"notebook_task\":{\"notebook_path\":\"/Users/user@databricks.com/path/test_notebook\"}},\"cluster_spec\":{\"new_cluster\":{\"spark_version\":\"4.3.x-scala2.11\",\"attributes\":{\"type\":\"fixed_node\",\"memory\":\"8g\"},\"enable_elastic_disk\":\"false\",\"num_workers\":1}},\"cluster_instance\":{\"cluster_id\":\"0000-000000-wares10\"},\"start_time\":1584689872601,\"setup_duration\":0,\"execution_duration\":0,\"cleanup_duration\":0,\"creator_user_name\":\"user@databricks.com\",\"run_name\":\"my test job\",\"run_page_url\":\"https://testurl.databricks.com#job/33100/run/1\",\"run_type\":\"SUBMIT_RUN\"}\r\n\r\nrows = [Row(**json_dict) for json_dict in [jsonDataDict]]\r\ndf = spark.createDataFrame(rows)\r\ndf.display()" +} \ No newline at end of file diff --git a/scraped_kb_articles/create-or-replace-sql-error-in-a-delta-table.json b/scraped_kb_articles/create-or-replace-sql-error-in-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..f84f70df53e8672ffd4593e2770dc836ea01de8f --- /dev/null +++ b/scraped_kb_articles/create-or-replace-sql-error-in-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/create-or-replace-sql-error-in-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to run the\nCREATE or REPLACE\nstatement against a Delta table, you may encounter the following issue:\n[TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view .. because it already exists. Choose a different name, drop or replace the existing object, add the IF NOT EXISTS clause to tolerate pre-existing objects, or add the OR REFRESH clause to refresh the existing streaming table.\nCause\nWhen multiple simultaneous\nCREATE OR REPLACE\nqueries run, a table or view may be created before the previous process has finished, leading to conflicts and errors.\nSolution\nCorrect the job schedule to ensure that only one query is executed at a time for a specific table.\nMake sure the catalog name, schema name and table name values are specified correctly.\nIn general, avoid running multiple\nCREATE OR REPLACE\nqueries simultaneously. Running parallel queries in notebooks and DBSQL targeting the creation or modification of a table might conflict." +} \ No newline at end of file diff --git a/scraped_kb_articles/create-table-ddl-for-metastore.json b/scraped_kb_articles/create-table-ddl-for-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..dfa77a7fe6e4f385bac80fa9df0a97726d60fa30 --- /dev/null +++ b/scraped_kb_articles/create-table-ddl-for-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/create-table-ddl-for-metastore", + "title": "Título do Artigo Desconhecido", + "content": "Databricks supports using external metastores instead of the default Hive metastore.\nYou can export all table metadata from Hive to the external metastore.\nUse the Apache Spark\nCatalog\nAPI to list the tables in the databases contained in the metastore.\nUse the\nSHOW CREATE TABLE\nstatement to generate the DDLs and store them in a file.\nUse the file to import the table DDLs into the external metastore.\nThe following code accomplishes the first two steps.\n%python\r\n\r\ndbs = spark.catalog.listDatabases()\r\nfor db in dbs:\r\n  f = open(\"your_file_name_{}.ddl\".format(db.name), \"w\")\r\n  tables = spark.catalog.listTables(db.name)\r\n  for t in tables:\r\n    DDL = spark.sql(\"SHOW CREATE TABLE {}.{}\".format(db.name, t.name))\r\n    f.write(DDL.first()[0])\r\n    f.write(\"\\n\")\r\nf.close()\nYou can use the resulting file to import the table DDLs into the external metastore." +} \ No newline at end of file diff --git a/scraped_kb_articles/create-table-error-external-hive.json b/scraped_kb_articles/create-table-error-external-hive.json new file mode 100644 index 0000000000000000000000000000000000000000..733809bfd652d860c37c23fe67fb021562a289f9 --- /dev/null +++ b/scraped_kb_articles/create-table-error-external-hive.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/create-table-error-external-hive", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are connecting to an external MySQL metastore and attempting to create a table when you get an error.\nAnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException:\r\nMetaException(message:An exception was thrown while adding/validating class(es) : (conn=21)\r\nColumn length too big for column 'PARAM_VALUE' (max = 16383); use BLOB or TEXT instead.\nCause\nThis is a known issue with MySQL 8.0 when the default charset is\nutfmb4\n.\nYou can confirm this by running a query on the database with the error.\n%sql\r\n\r\nSELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = \"\"\nSolution\nYou need to update or recreate the database and set the charset to\nlatin1\n.\nOption 1\nManually run create statements in the Hive database with\nDEFAULT CHARSET=latin1\nat the end of each\nCREATE TABLE\nstatement.\n%sql\r\n\r\nCREATE TABLE `TABLE_PARAMS`\r\n(\r\n    `TBL_ID` BIGINT NOT NULL,\r\n    `PARAM_KEY` VARCHAR(256) BINARY NOT NULL,\r\n    `PARAM_VALUE` VARCHAR(4000) BINARY NULL,\r\n    CONSTRAINT `TABLE_PARAMS_PK` PRIMARY KEY (`TBL_ID`,`PARAM_KEY`)\r\n) ENGINE=INNODB DEFAULT CHARSET=latin1;\nRestart the Hive metastore and repeat until all creation errors have been resolved.\nOption 2\nSetup the database and user accounts.\nCreate the database and run\nalter database hive character set latin1;\nbefore you launch the metastore.\nThis command sets the default\nCHARSET\nfor the database. It is applied when the metastore creates tables." +} \ No newline at end of file diff --git a/scraped_kb_articles/create-table-json-serde.json b/scraped_kb_articles/create-table-json-serde.json new file mode 100644 index 0000000000000000000000000000000000000000..9e922fa3d7d977219c9e718d0df173b44d88d444 --- /dev/null +++ b/scraped_kb_articles/create-table-json-serde.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/create-table-json-serde", + "title": "Título do Artigo Desconhecido", + "content": "In this article we cover how to create a table on JSON datasets using SerDe.\nDownload the JSON SerDe JAR\nOpen the\nhive-json-serde 1.3.8\ndownload page.\nClick on\njson-serde-1.3.8-jar-with-dependencies.jar\nto download the file\njson-serde-1.3.8-jar-with-dependencies.jar\n.\nDelete\nInfo\nYou can review the\nHive-JSON-Serde\nGitHub repo for more information on the JAR, including source code.\nInstall the JSON SerDe JAR on your cluster\nSelect your cluster in the workspace.\nClick the\nLibraries\ntab.\nClick\nInstall new\n.\nIn the Library Source button list, select\nUpload\n.\nIn the Library Type button list, select\nJAR\n.\nClick\nDrop JAR here\n.\nSelect the\njson-serde-1.3.8-jar-with-dependencies.jar\nfile.\nClick\nInstall\n.\nConfigure SerDe properties in the create table statement\n%sql\r\n\r\nROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'\r\nSTORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'\r\nOUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'\r\nLOCATION ''\nFor example:\n%sql\r\n\r\ncreate table (timestamp_unix string, comments string, start_date string, end_date string)\r\npartitioned by (yyyy string, mm string, dd string)\r\nROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'\r\nSTORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'\r\nOUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'\r\nLOCATION ''\r\nThis example creates a table that is partitioned by the columns yyyy, mm, and dd.\nRun a repair table statement after the table is created\nFor example:\n%sql\r\n\r\nmsck repair table " +} \ No newline at end of file diff --git a/scraped_kb_articles/creating-a-pat-or-oauth-token-to-access-azure-databricks.json b/scraped_kb_articles/creating-a-pat-or-oauth-token-to-access-azure-databricks.json new file mode 100644 index 0000000000000000000000000000000000000000..047c8820b0ce794fd25c25ac3603957b6303bcac --- /dev/null +++ b/scraped_kb_articles/creating-a-pat-or-oauth-token-to-access-azure-databricks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/creating-a-pat-or-oauth-token-to-access-azure-databricks", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to access Azure Databricks resources without logging into the workspace, and by sharing a token with an end user through an API or integrating with third party tools/applications.\nCause\nYou want to automate access or otherwise avoid extra user intervention.\nSolution\nTo authenticate with the service principal, you can use Personal Access Tokens (PATs) or an OAuth access token, which is a Microsoft Entra ID (formerly Azure Active Directory) access token.\nNote\nDatabricks recommends using OAuth access tokens instead of PATs for greater security and convenience. To use an OAuth token, refer to the\nAuthenticate access to Azure Databricks with a service principal using OAuth\ndocumentation.\nDatabricks continues to support PATs but due to their greater security risk, it is suggested that you audit your account’s current PAT usage.\nThe OAuth token, or Microsoft Entra ID access token, is valid for 60 minutes. If you need to set a defined lifetime, you can create a PAT for your service principal either through Azure and Databricks CLI or using a REST API with curl.\nCreate a PAT with Azure and Databricks CLI\nThese instructions assume the service principal is already a user at the workspace level.\n1. Gather the service principal credentials: the client ID (App ID) and secret.\n2. Login to the Azure CLI with the service principal credentials.\naz login \\\r\n --service-principal \\\r\n --tenant \"$tenant_id\" \\\r\n --username \"$client_id\" \\\r\n --password \"$app_secret\" \\\r\n --allow-no-subscriptions\n3. Get a Microsoft EntraID token with the following Azure CLI command. The resource ID\n2ff814a6-3304-4ab8-85cb-cd0e6f879c1d\nrefers to the Azure Databricks service.\nexport DATABRICKS_AAD_TOKEN=$(az account get-access-token \\\r\n--resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d \\\r\n--query \"accessToken\" \\\r\n--output tsv)\n4. Configure the Databricks CLI with the URL of the workspace and the Microsoft EntraID token created in the previous step.\ndatabricks configure \\\r\n --aad-token \\\r\n --host https://adb-6696713541144394.14.azuredatabricks.net/\n5. Set the token permission for the service principal by following the\nSet token permissions\nAPI instructions. This step is required.\n6. Create a PAT. For reference, one day is 86400 seconds. If the\n--lifetime-seconds\noption is not specified, the access token will never expire (not recommended).\ndatabricks tokens \\\r\ncreate \\\r\n--lifetime-seconds  \\\r\n--comment $token_name\nThe service principal can now manage its own PATs. The\n.token_value\nfield in the JSON output is the last time you will see the PAT, so save it somewhere safe at this point.\nRevoke a PAT with the CLI\nTo remove the token before it expires, follow the below command. You can get the\ntoken-id\nusing the command\nDatabricks tokens list\n.\ndatabricks tokens \\\r\nrevoke \\\r\n--token-id $token_id\nCreate a PAT using a REST API with curl\nThese instructions assume the service principal is already a user at the workspace level.\n1. Login to Azure Active Directory (AAD) with the service principal credentials (client ID (App ID) and secret).\n2. Obtain a Microsoft Entra ID token. The resource ID\n2ff814a6-3304-4ab8-85cb-cd0e6f879c1d\nrefers to the Azure Databricks service. To get your Databricks hostname, refer to the\nGet identifiers for workspaces objects\ndocumentation.\nexport DATABRICKS_HOST=\"\"\r\n\r\nexport AAD_TOKEN=$(curl \\\r\n-X POST \\\r\n-H 'Content-Type: application/x-www-form-urlencoded' \\\r\n\"https://login.microsoftonline.com/${}/oauth2/v2.0/token\" \\\r\n-d \"client_id=${}\" \\\r\n-d 'grant_type=client_credentials' \\\r\n-d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \\\r\n-d \"client_secret=${}\" \\\r\n| jq -r '.access_token')\n3. Set the token permission for the service principal by following the\nSet token permissions\nAPI instructions. This step is required.\n4. Create a PAT with the previously obtained Microsoft Entra ID token.\ncurl -v -X POST \\\r\n-H \"Content-Type: application/json\" \\\r\n-H  \"Authorization: Bearer ${}\" \\\r\n\"https://${DATABRICKS_HOST}/api/2.0/token/create\" \\\r\n-d '{ \"lifetime_seconds\": 7776000, \"comment\": \"my-pat-123\" }'\nThe service principal can now manage its own PATs.\nRevoke a PAT using a curl\nFirst, get the\ntoken_id\nwhich needs to be revoked using the Databricks REST API. The following command returns a list of tokens associated with your Azure Databricks account. Find your\ntoken_id\nin that list. Make sure the PAT you supply has the necessary permissions to list tokens.\ncurl -X GET https:///api/2.0/token/list \\\r\n  -H \"Authorization: Bearer \"\nThen, run the below curl to revoke that token.\ncurl -v -X POST \\\r\n-H \"Content-Type: application/json\" \\\r\n-H  \"Authorization: Bearer ${}\" \\\r\n\"https://${DATABRICKS_HOST}/api/2.0/token/delete\" \\\r\n-d \"{ \\\"token_id\\\": \\\"$TOKEN_ID\\\" }\"" +} \ No newline at end of file diff --git a/scraped_kb_articles/creating-a-temporary-view-inside-a-function-returns-an-invalid_temp_obj_reference-error.json b/scraped_kb_articles/creating-a-temporary-view-inside-a-function-returns-an-invalid_temp_obj_reference-error.json new file mode 100644 index 0000000000000000000000000000000000000000..cc5971001d3e63ae7c0e2c5221aaac49d593df06 --- /dev/null +++ b/scraped_kb_articles/creating-a-temporary-view-inside-a-function-returns-an-invalid_temp_obj_reference-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/creating-a-temporary-view-inside-a-function-returns-an-invalid_temp_obj_reference-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou try to create a temporary view within a function using the following code.\nCREATE TEMP VIEW AS\r\nSELECT \r\nFROM \r\nWHERE ;\r\n\r\nCREATE FUNCTION ()\r\nRETURNS TABLE()\r\nRETURN WITH cte AS (\r\n    SELECT \r\n    FROM \r\n)\r\nSELECT * FROM cte;\nWhen you run the code, you receive an error.\n[INVALID_TEMP_OBJ_REFERENCE] Cannot create the persistent object ``.``.`` of the type FUNCTION because it references to the temporary object `` of the type VIEW. SQLSTATE: 42K0F\nCause\nA persistent object, such as a function, cannot reference a temporary object, such as a temporary view. Persistent objects have a longer lifespan and are available across different sessions, while temporary objects are session-specific and have a limited lifespan.\nSolution\nMake the function temporary so both the function and the temporary view have the same lifespan and session scope.\nFirst, create a temporary view to filter or transform data from the persistent table.\nCREATE TEMP VIEW AS\r\nSELECT \r\nFROM \r\nWHERE ;\nNext, create a temporary function that uses the temporary view. The function then takes a parameter and returns a table.\nCREATE TEMP FUNCTION ()\r\nRETURNS TABLE()\r\nRETURN WITH cte AS (\r\n    SELECT \r\n    FROM \r\n)\r\nSELECT * FROM cte;\nLast, execute the temporary function.\nSELECT * FROM ();" +} \ No newline at end of file diff --git a/scraped_kb_articles/creating-an-azure-key-vault-backed-secret-scope-with-the-databricks-cli-fails-with-a-useraadtoken-error.json b/scraped_kb_articles/creating-an-azure-key-vault-backed-secret-scope-with-the-databricks-cli-fails-with-a-useraadtoken-error.json new file mode 100644 index 0000000000000000000000000000000000000000..2ab01b481802d9becd641c2582ade23a5d5f1236 --- /dev/null +++ b/scraped_kb_articles/creating-an-azure-key-vault-backed-secret-scope-with-the-databricks-cli-fails-with-a-useraadtoken-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/creating-an-azure-key-vault-backed-secret-scope-with-the-databricks-cli-fails-with-a-useraadtoken-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to create an Azure Key Vault-backed secret scope using the Databricks CLI, but the operation fails with an AAD token definition error message.\n\"Scope with Azure KeyVault must have userAADToken defined!\"\nCause\nThis error occurs when the authentication is not properly configured in the\n.databrickscfg\nfile for the service principal. You did not define the necessary\nuserAADToken\n.\nSolution\nFollow the steps below to create an Azure Key Vault-backed secret scope using the Databricks CLI.\nNote\nCreating an Azure Key Vault-backed secret scope requires the Contributor or Owner role on the Azure key vault instance, even if the Azure Databricks service has previously been granted access to the key vault. Ensure the service principal has the necessary roles assigned. For more information, review the\nAzure Key Vault-backed secret scope requirements\ndocumentation.\n1. Create or update the\n.databrickscfg\nfile to include a profile for the service principal. Use the following format:\n[DEFAULT] \r\n\thost =  \r\n\tazure_workspace_resource_id = \r\n\ttenant_id =  \r\n\tclient_id =  \r\n\tclient_secret = \nWhere:\n\n– The URL of your Databricks workspace.\n\n– The Azure resource ID of your Databricks workspace, available in the Azure portal under Resource JSON.\n\n– The Directory (tenant) ID associated with the Azure Active Directory (Microsoft Entra ID) where the service principal is registered.\n\n– The Application (client) ID of your Azure service principal.\n\n– The client secret value generated when you created client credentials.\nFor more information, review the\nMicrosoft Entra ID service principal authentication\ndocumentation.\n2. Authenticate in the Databricks CLI using the service principal.\ndatabricks auth env --profile DEFAULT\n3. Create a JSON configuration file.\n{\r\n  \"scope\": \"\",\r\n  \"initial_manage_principal\": \"users\",\r\n  \"scope_backend_type\": \"AZURE_KEYVAULT\",\r\n  \"backend_azure_keyvault\": {\r\n  \"resource_id\": \"\",\r\n  \"dns_name\": \"\"\r\n        }\r\n}\nWhere:\n\n– A name for the secret scope you are creating.\n\n– Available in the Overview section of your Key Vault >JSON View. Copy the Resource ID and paste in the above code.\n\n– Also available in the Overview section of your Key Vault > JSON View > Properties. Copy the vaultUri and paste in the above code.\n4. Create the secret scope with the Databricks CLI.\ndatabricks secrets create-scope --json @path_to_json_file.json\n5. List all secret scopes to verify the new secret scope was successfully created.\ndatabricks secrets list-scopes" +} \ No newline at end of file diff --git a/scraped_kb_articles/creation-failure-error-when-trying-to-create-a-vector-search-index.json b/scraped_kb_articles/creation-failure-error-when-trying-to-create-a-vector-search-index.json new file mode 100644 index 0000000000000000000000000000000000000000..6ddd84ed3faaef394c23d8369a5960a90bd2d9ce --- /dev/null +++ b/scraped_kb_articles/creation-failure-error-when-trying-to-create-a-vector-search-index.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/creation-failure-error-when-trying-to-create-a-vector-search-index", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re trying to create a Vector Search index. Whether you use the UI or the API, you receive an error.\nUsing the UI\nWhile trying to create a Vector Search index using the UI, you receive the following error message.\nIndex creation failed: Failed to call Model Serving endpoint \nUsing the API\nWhile trying to create a Vector Search index using the Databricks API, you receive the following error message in response to your call.\n\"error_code\":\"INVALID_PARAMETER_VALUE\",\"message\":\"Failed to call Model Serving endpoint: .\",\"details\":[{\"@type\":\"type.googleapis.com/google.rpc.RequestInfo\",\"request_id\":\"\",\"serving_data\":\"\"}]}\nCause\nYou are not using a text embedding model. When creating a Vector Search model, only text embedding models are permitted.\nSolution\nUse a text embedding model such as GTE Large (En) or BGE Large (En).\nFor details, refer to the “GTE Large (En)” and “BGE Large (En)” sections of the\nSupported models for Databricks Foundation Models APIs\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cross-cloud-delta-sharing-query-results-in-403-response.json b/scraped_kb_articles/cross-cloud-delta-sharing-query-results-in-403-response.json new file mode 100644 index 0000000000000000000000000000000000000000..481501bb3c75e57e70824446c8ef305c590e6176 --- /dev/null +++ b/scraped_kb_articles/cross-cloud-delta-sharing-query-results-in-403-response.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/cross-cloud-delta-sharing-query-results-in-403-response", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an Azure Databricks workspace with a private network storage account and you are setting up Databricks-to-Databricks Delta Sharing with a recipient on another cloud platform to do cross-cloud sharing. The recipient tries to query the shared tables, but gets a file read exception error. The 403 error message details indicate an authorization failure.\nExample error message\nFileReadException: Error while reading file delta-sharing:/XXXXXXXXXXXXXXXXXXcc.uc-deltasharing%253A%252F%252Fshare_test.sample_schema.sample_table%2523share_test.sample_schema.sample_table_XXXXXXXXXXXXXX/XXXXXXXXXXXXXX/XXXXXX. org.apache.spark.SparkIOException Caused by: SparkIOException: [HDFS_HTTP_ERROR.UNCATEGORIZED] When attempting to read from HDFS, HTTP request failed. HTTP request failed with status: HTTP/1.1 403 This request is not authorized to perform this operation. {\"error\":{\"code\":\"AuthorizationFailure\",\"message\":\"This request is not authorized to perform this operation.\\nRequestId:XXXXXX\\nTime:2024-09-23T13:17:14.8410995Z\"}}, while accessing URI of shared table file SQLSTATE: KD00F Caused by: UnexpectedHttpStatus: HTTP request failed with status: HTTP/1.1 403 This request is not authorized to perform this operation. {\"error\":{\"code\":\"AuthorizationFailure\",\"message\":\"This request is not authorized to perform this operation.\\nRequestId:XXXXXX\\nTime:2024-09-23T13:17:14.8410995Z\"}}, while accessing URI of shared table file\nCause\nThe 403 error and\nAuthorizationFailure\nmessage indicate the recipient’s queries are being blocked by your storage account firewall settings. This typically occurs when the recipient’s workspace egress IP has not been added to the allowlist on your Azure storage account's firewall.\nSolution\nClassic compute and pro warehouse recipients\nAdd the egress IP of the recipient’s workspace to your Azure storage account firewall allowlist.\nThe fixed egress IP of the recipient workspace can be found by inspecting the NAT gateway or the static public IP address attached to the recipient of the Azure Databricks workspace VNET. You can also allow the recipient VNET of the Azure Databricks workspace to directly connect to the storage account if you are not routing traffic through the Internet and are only using Azure service endpoints.\nLog in to the Azure portal.\nNavigate to your storage account where the shared data is residing.\nClick\nNetworking\nunder\nSecurity + networking\n.\nClick\nFirewalls and virtual networks\n.\nUnder\nFirewall\n, add the egress IP to allowlist it on the storage account.\nSave the changes.\nServerless recipients\nIf the recipient workspace uses serverless compute to query the shared data, its egress traffic originates from the serverless compute plane. To manage this traffic effectively, Databricks recommends creating and attaching a Network Connectivity Configuration (NCC) to the recipient workspace.\nOnce the NCC is attached, the egress serverless compute traffic from the recipient’s workspace will use a set of stable IP addresses. These IPs can then be allowlisted on the data provider’s Azure Storage account to ensure secure access to the share.\nFor detailed guidance on obtaining and configuring a list of stable IPs to allowlist, refer to the\nConfigure a firewall for serverless compute access\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/cuda-out-of-memory-error-message-in-gpu-clusters.json b/scraped_kb_articles/cuda-out-of-memory-error-message-in-gpu-clusters.json new file mode 100644 index 0000000000000000000000000000000000000000..49116536d9cc76a7c8c9099e2d558696706ed1b9 --- /dev/null +++ b/scraped_kb_articles/cuda-out-of-memory-error-message-in-gpu-clusters.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/cuda-out-of-memory-error-message-in-gpu-clusters", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen performing model training or fine-tuning a base model using a GPU compute cluster, you encounter the following error (with varying GiB and MiB values) during these processes:\ntorch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 21.99 GiB total capacity; 20.84 GiB already allocated; 19.00 MiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\r\nThis error is typically raised by workloads that utilize PyTorch or other libraries that have PyTorch implementations, such as the Transformers library.\nCause\nThe CUDA memory is running out while trying to allocate additional memory for the model. The error arises when there is not enough free memory available.\nInfo\nGPU memory is separate from the memory used by the worker and driver nodes of the cluster. GPU memory is specific to the GPU device being used for computations.\nYou can check GPU utilization by navigating to the\nMetrics\ntab of the cluster you use to run a notebook. From there, you can filter the results by selecting\nGPU\nfrom the dropdown button in the top-right corner of the page.\nSolution\nSelect a suitable GPU device for your intended task, whether it's model training, fine-tuning, or inference. After determining which GPU device is best suited for your workload, navigate to\nCompute\n, select an existing cluster or create a new one, then select a Driver/Worker node type that utilizes the chosen GPU device. Once you've made this selection, you can resume working with your model.\nInfo\nEach cloud provider decides which instance types are available in each region. Review the cloud provider documentation to\ndetermine if a specific GPU is available in the region\n(\nAWS\n,\nAzure\n,\nGCP\n) you are using.\nModel training\nResearch the GPU devices available in compute instances for your cloud provider. For example, to address the problem stated in the error message, if your current cluster instance contains T4 GPU devices, consider switching to A10 or V100 devices, which offer larger memory capacities. Then, rerun your process.\nFine-tuning or inference\nCheck the model's repository on GitHub or its page on Hugging Face to see if specific GPU devices are recommended for specific tasks with that model. For example,\nDatabricks' Dolly LLM Github repository\nspecifies particular GPU instances to get started with response generation and training." +} \ No newline at end of file diff --git a/scraped_kb_articles/cuda-outofmemoryerror-tried-to-allocate-n-mib-while-performing-model-training-on-the-gpu-compute.json b/scraped_kb_articles/cuda-outofmemoryerror-tried-to-allocate-n-mib-while-performing-model-training-on-the-gpu-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..deb233e59c2ce8af2b5ef026ade1970feb57b2c8 --- /dev/null +++ b/scraped_kb_articles/cuda-outofmemoryerror-tried-to-allocate-n-mib-while-performing-model-training-on-the-gpu-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/cuda-outofmemoryerror-tried-to-allocate-n-mib-while-performing-model-training-on-the-gpu-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile performing model training on a GPU-enabled compute, you receive an error message similar to the following example.\nwarnings.warn(\r\nOutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU\r\nFile , line 88\r\n85 return outputs[0][\"generated_text\"]\r\n87 # Instantiate the Hugging Face–based LLMs\r\n---> 88 response_llm = MistralLLM()\nCause\nGPU memory utilization is high.\nBy default, the model loads in 32-bit precision. Loading models with higher parameters (for example, greater than seven billion) requires higher-capacity GPU memory than the default.\nSolution\nUpdate to a higher-capacity GPU node.\nAlternatively, if you do not want to use a higher-capacity GPU node, modify your model’s\nfrom_pretrained()\ncall to reduce the precision from 32 bit to 16 bit and automatically distribute model layers across available devices (GPU or CPU) without manual checks.\nMake sure to comment out the\ndevice=0 if torch.cuda.is_available() else -1\nline of code, which is a manual check for GPU availability. It conflicts with\ndevice_map=\"auto\"\n.\nfrom transformers import AutoModelForCausalLM\r\nimport torch\r\n\r\n#Comment out this block:\r\n#device=0 if torch.cuda.is_available() else -1 #This is where we check if to use GPU if available \r\n\r\nmodel = AutoModelForCausalLM.from_pretrained(\r\n    \"\",\r\n    torch_dtype=torch.float16,#This will Force FP16 reducing precision from 32 bit to 16 bit.\r\n     device_map=\"auto\" #Automatically place layers on GPU/CPU\r\n )" +} \ No newline at end of file diff --git a/scraped_kb_articles/custom-dns-routing.json b/scraped_kb_articles/custom-dns-routing.json new file mode 100644 index 0000000000000000000000000000000000000000..af8786c25c3333f8f7d54c08c8bb9bcf621fdd03 --- /dev/null +++ b/scraped_kb_articles/custom-dns-routing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/custom-dns-routing", + "title": "Título do Artigo Desconhecido", + "content": "dnsmasq\nis a tool for installing and configuring DNS routing rules for cluster nodes. You can use it to set up routing between your Databricks environment and your on-premise network.\nDelete\nWarning\nIf you use your own DNS server and it goes down, you will experience an outage and will not be able to create clusters.\nUse the following\ncluster-scoped init script\nto configure dnsmasq for a cluster node.\nUse\nnetcat (nc)\nto test connectivity from the notebook environment to your on-premise network.\nnc -vz 53\nCreate the base directory you want to store the init script in if it does not already exist.\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the script.\nAWS Scala example\ndbutils.fs.put(\"/databricks//dns-masq.sh\",\"\"\"\r\n#!/bin/bash\r\n########################################\r\nConfigure on-prem dns access.\r\n########################################\r\n\r\nsudo apt-get update -y\r\nsudo apt-get install dnsmasq -y --force-yes\r\n\r\n## Add dns entries for internal your-company.net name servers\r\necho server=/databricks.net/ | sudo tee --append /etc/dnsmasq.conf\r\n\r\n## Find the default DNS settings for the EC2 instance and use them as the default DNS route\r\n\r\nec2_dns=cat /etc/resolv.conf | grep \"nameserver\"; | cut -d' ' -f 2\r\necho \"Old dns in resolv.conf $ec2_dns\"\r\n\r\necho \"server=$ec2_dns\" | sudo tee --append /etc/dnsmasq.conf\r\n\r\n## configure resolv.conf to point to dnsmasq service instead of static resolv.conf file\r\nmv /etc/resolv.conf /etc/resolv.conf.orig\r\necho nameserver 127.0.0.1 | sudo tee --append /etc/resolv.conf\r\nsudo systemctl disable --now systemd-resolved\r\nsudo systemctl enable --now dnsmasq\r\n\"\"\", true)\nDelete\nAzure Scala example\ndbutils.fs.put(\"/databricks//dns-masq.sh\",\"\"\"\r\n#!/bin/bash\r\nsudo apt-get update -y\r\nsudo apt-get install dnsmasq -y --force-yes\r\n\r\n## Add dns entries for internal nameservers\r\necho server=/databricks.net/ | sudo tee --append /etc/dnsmasq.conf\r\n   \r\n## Find the default DNS settings for the instance and use them as the default DNS route\r\nazvm_dns=cat /etc/resolv.conf | grep \"nameserver\"; | cut -d' ' -f 2\r\necho \"Old dns in resolv.conf $azvm_dns\"\r\necho \"server=$azvm_dns\" | sudo tee --append /etc/dnsmasq.conf\r\n    \r\n## configure resolv.conf to point to dnsmasq service instead of static resolv.conf file\r\nmv /etc/resolv.conf /etc/resolv.conf.orig\r\necho nameserver 127.0.0.1 | sudo tee --append /etc/resolv.conf\r\nsudo systemctl disable --now systemd-resolved\r\nsudo systemctl enable --now dnsmasq\r\n\"\"\", true)\nDelete\nCheck that the script exists.\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//dns-masq.sh\"))\nConfigure the init script that you just created as a\ncluster-scoped init script\n. You will need the full path to the location of the script (\ndbfs:/databricks//dns-masq.sh\n).\nLaunch a zero-node cluster to confirm that you can create clusters." +} \ No newline at end of file diff --git a/scraped_kb_articles/custom-docker-requires-root.json b/scraped_kb_articles/custom-docker-requires-root.json new file mode 100644 index 0000000000000000000000000000000000000000..a5e4c271420598c4fbe3fe901b2c6abb3d595a86 --- /dev/null +++ b/scraped_kb_articles/custom-docker-requires-root.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/custom-docker-requires-root", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to launch a Databricks cluster with a custom Docker container, but cluster creation fails with an error.\n{\r\n\"reason\": {\r\n\"code\": \"CONTAINER_LAUNCH_FAILURE\",\r\n\"type\": \"SERVICE_FAULT\",\r\n\"parameters\": {\r\n\"instance_id\": \"i-xxxxxxx\",\r\n\"databricks_error_message\": \"Failed to launch spark container on instance i-xxxx. Exception: Could not add container for xxxx with address xxxx. Could not mkdir in container\"\r\n              }\r\n          }\r\n}\nCause\nDatabricks clusters require a root user and sudo.\nCustom container images that are configured to start as a non-root user are not supported.\nFor more information, review the\ncustom container\ndocumentation.\nSolution\nYou must configure your Docker container to start as the root user.\nExample\nThis container configuration starts as the standard user ubuntu. It fails to launch.\nFROM databricksruntime/standard:8.x\r\nRUN apt-get update -y && apt-get install -y git && \\\r\nln -s /databricks/conda/envs/dcs-minimal/bin/pip /usr/local/bin/pip && \\\r\nln -s /databricks/conda/envs/dcs-minimal/bin/python /usr/local/bin/python\r\nCOPY . /app\r\nWORKDIR /app\r\nRUN pip install -r requirements.txt .\r\nRUN chown -R ubuntu /app\r\nUSER ubuntu\nThis container configuration starts as the root user. It launches successfully.\nFROM databricksruntime/standard:8.x\r\nRUN apt-get update -y && apt-get install -y git && \\\r\nln -s /databricks/conda/envs/dcs-minimal/bin/pip /usr/local/bin/pip && \\\r\nln -s /databricks/conda/envs/dcs-minimal/bin/python /usr/local/bin/python\r\nCOPY . /app\r\nWORKDIR /app\r\nRUN pip install -r requirements.txt ." +} \ No newline at end of file diff --git a/scraped_kb_articles/custom-garbage-collection-prevents-launch-dbr10.json b/scraped_kb_articles/custom-garbage-collection-prevents-launch-dbr10.json new file mode 100644 index 0000000000000000000000000000000000000000..1a44f24e67129a73f373f2b150e199fc2cb7570b --- /dev/null +++ b/scraped_kb_articles/custom-garbage-collection-prevents-launch-dbr10.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/custom-garbage-collection-prevents-launch-dbr10", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use a custom Apache Spark garbage collection algorithm (other than the default one (parallel garbage collection) on clusters running Databricks Runtime 10.0 and above. When you try to start a cluster, it fails to start. If the configuration is set on an executor, the executor is immediately terminated.\nFor example, if you set either of the following custom garbage collection algorithms in your\nSpark config\n, the cluster creation fails.\nSpark driver\nspark.driver.extraJavaOptions  -XX:+UseG1GC\nSpark executor\nspark.executor.extraJavaOptions -XX:+UseG1GC\nCause\nA new Java virtual machine (JVM) flag was introduced to set the garbage collection algorithm to parallel garbage collection. If you do not change the default, the change has no impact.\nIf you change the garbage collection algorithm by setting\nspark.executor.extraJavaOptions\nor\nspark.driver.extraJavaOptions\nin your\nSpark config\n, the value conflicts with the new flag. As a result, the JVM crashes and prevents the cluster from starting.\nSolution\nTo work around this issue, you must explicitly remove the parallel garbage collection flag in your\nSpark config\n. This must be done at the cluster level.\nspark.driver.extraJavaOptions -XX:-UseParallelGC -XX:+UseG1GC\r\nspark.executor.extraJavaOptions -XX:-UseParallelGC -XX:+UseG1GC" +} \ No newline at end of file diff --git a/scraped_kb_articles/cython.json b/scraped_kb_articles/cython.json new file mode 100644 index 0000000000000000000000000000000000000000..9cf0d1480df77336da6c1519d7607f27eeecee4c --- /dev/null +++ b/scraped_kb_articles/cython.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/cython", + "title": "Título do Artigo Desconhecido", + "content": "This document explains how to run Spark code with compiled Cython code. The steps are as follows:\nCreates an example Cython module on DBFS (\nAWS\n|\nAzure\n).\nAdds the file to the Spark session.\nCreates a wrapper method to load the module on the executors.\nRuns the mapper on a sample dataset.\nGenerate a larger dataset and compare the performance with native Python example.\nDelete\nInfo\nBy default, paths use\ndbfs:/\nif no protocol is referenced.\n%python\r\n\r\n# Write an example cython module to /example/cython/fib.pyx in DBFS.\r\ndbutils.fs.put(\"/example/cython/fib.pyx\", \"\"\"\r\ndef fib_mapper_cython(n):\r\n    '''\r\n    Return the first fibonnaci number > n.\r\n    '''\r\n    cdef int a = 0\r\n    cdef int b = 1\r\n    cdef int j = int(n)\r\n    while b RETAIN 0 HOURS\nOR\n%sql VACUUM delta.`` RETAIN 0 HOURS\nWhen\nVACUUM\nis configured to retain 0 hours it can delete any file that is not part of the version that is being vacuumed. This includes committed files, uncommitted files, and temporary files for concurrent transactions.\nConsider the following example timeline:\nVACUUM\nstarts running at 01:17 UTC on version 100.\nA data file named part-.snappy.parquet is added to version 101 at 01:18 UTC.\nVersion 101 is committed at 01:19 UTC.\nVACUUM\nis still running which deleted the data file part-.snappy.parquet added in version 101 at 01:20 UTC.\nVACUUM\ncompletes at 01:22 UTC.\nIn this example,\nVACUUM\nexecuted on version 100 and deleted everything that was added to version 101.\nSolution\nDatabricks recommends that you set a\nVACUUM\nretention interval to at least 7 days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.\nDo not set\nspark.databricks.delta.retentionDurationCheck.enabled\nto false in your Spark config.\nIf you do set\nspark.databricks.delta.retentionDurationCheck.enabled\nto false in your Spark config, you must choose an interval that is longer than the longest-running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.\nReview the Databricks VACUUM documentation (\nAWS\n|\nAzure\n|\nGCP\n) for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/data-too-long-for-column.json b/scraped_kb_articles/data-too-long-for-column.json new file mode 100644 index 0000000000000000000000000000000000000000..04e3e337887d4aaa7e8bdfa593de458e66225c71 --- /dev/null +++ b/scraped_kb_articles/data-too-long-for-column.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/data-too-long-for-column", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to insert a struct into a table, but you get a\njava.sql.SQLException: Data too long for column\nerror.\nCaused by: java.sql.SQLException: Data too long for column 'TYPE_NAME' at row 1\r\nQuery is: INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,`COLUMN_NAME`,TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?) , parameters [103182,,'address','struct,street_address2:struct,street_address3:struct,street_address4:struct,street_address5:struct,street_address6:struct,street_address7:struct,street_address8:struct\nis the\nServer Hostname\nfor your cluster or SQL warehouse.\n\nis the\nHTTP Path\nvalue for your cluster or SQL warehouse.\n\nis your Databricks personal access token.\nInfo\nYou can find the hostname and path values for a cluster in the\nJDBC/ODBC\ntab in the\nAdvanced options\non a cluster’s configuration page. You can find the same values for a SQL warehouse in the\nConnection details\ntab on the warehouse’s configuration page.\n%python\r\n\r\nfrom databricks import sql\r\n\r\nconnection = sql.connect(\r\n    server_hostname = \"\",\r\n    http_path       = \"\",\r\n    access_token    = \"\"\r\n)\r\n\r\ncursor = connection.cursor()\r\ncursor.execute(\"SELECT * FROM my_table LIMIT 10\")\r\nresults = cursor.fetchall()\r\n\r\nfor row in results:\r\n    print(row)\r\n\r\ncursor.close()\r\nconnection.close()" +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-based-queries-using-snowflake-federated-tables-running-slowly.json b/scraped_kb_articles/databricks-based-queries-using-snowflake-federated-tables-running-slowly.json new file mode 100644 index 0000000000000000000000000000000000000000..3a808696747a0951fed14822e1697fd4dc4698be --- /dev/null +++ b/scraped_kb_articles/databricks-based-queries-using-snowflake-federated-tables-running-slowly.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/databricks-based-queries-using-snowflake-federated-tables-running-slowly", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nComplex SQL queries using Snowflake federated tables are taking more time than you expect to run within Databricks.\nCause\nCertain query components, such as joins or aggregations, are performed in Databricks instead of being pushed down to Snowflake for processing. This results in increased data transfer over the network, which can reduce efficiency.\nContext\nBy default, the Apache Spark configuration flag\nspark.databricks.optimizer.aggregatePushdown.enabled\nis set to\ntrue\n. When enabled, Spark tries to push down partial aggregations below join nodes. However, the Snowflake connector’s pushdown strategy does not always support partial aggregations.\nAs a result, if the node below the join cannot be pushed down, the join itself is also not pushed down, and both are executed in Databricks instead of Snowflake.\nThis leads to multiple queries being sent to Snowflake, with the join and aggregation performed on the Databricks side, significantly slowing down query performance.\nSolution\nSet\nspark.databricks.optimizer.aggregatePushdown.enabled\nto\nfalse\nin your cluster settings. This prevents Spark from generating partial aggregate nodes, allowing both join and aggregation operations to be pushed down entirely to Snowflake. As a result, Snowflake can process the query more efficiently and return only the final aggregated results to Databricks, leading to faster query execution.\n1. In the Databricks UI, navigate to the\nCompute\nmenu option in the vertical menu on the left.\n2. Select the cluster you are using.\n3. In the\nCluster Configuration\ntab, click the\nEdit\nbutton in the top right.\n4. Scroll down to the\nAdvanced Options\nsection and click to expand.\n5. Enter the below configuration in the\nSpark > Spark Config\nfield.\nspark.databricks.optimizer.aggregatePushdown.enabled false\n6. Save the changes and restart the cluster for the new configuration to take effect." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-cannot-access-a-notebook-in-github.json b/scraped_kb_articles/databricks-cannot-access-a-notebook-in-github.json new file mode 100644 index 0000000000000000000000000000000000000000..e03571e602540a2be7d08b8acbca4dd2fb75644a --- /dev/null +++ b/scraped_kb_articles/databricks-cannot-access-a-notebook-in-github.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/databricks-cannot-access-a-notebook-in-github", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Databricks job may fail to access a notebook in a GitHub repository after previously being able to.\nUnable to access the notebook 'resources/notebooks/examplename'. Either it does not exist, or the identity used to run this job, (), lacks the required permissions.\nCause\nThere are two possible causes.\nThe first is the notebook was modified from a Databricks notebook to a standard Python script and is missing the necessary notebook identifier.\nThe second is an issue with the GitHub credentials or permissions associated with the service principal or user running the job.\nSolution\nIf the cause relates to file type\nIf your\n.py\nfile is a notebook within the logic, you must ensure that the\nexamplename.py\nfile in the GitHub repository starts with the line\n# Databricks notebook source\n.\nFor more information, please review the\nExport and import Databricks notebooks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you intend the file to be a standard Python script within the logic, update the job configuration in Databricks to treat the file as a Python script. Change the task type from\nNotebook\nto\nPython script\nand include the\n.py\nextension in the file path.\nFor more information, please refer to the\nUse version-controlled source code in a Databricks job\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf the cause relates to GitHub credentials or permissions\nVerify that the GitHub credentials and permissions for the service principal or user running the job are correctly configured. For more information, please review the\nConfigure Git credentials & connect a remote repo to Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nPlease also refer to the\nManage file assets in Databricks Git folders\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nService principals for CI/CD\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-connect-job-fails-after-a-databricks-runtime-update.json b/scraped_kb_articles/databricks-connect-job-fails-after-a-databricks-runtime-update.json new file mode 100644 index 0000000000000000000000000000000000000000..40fce481d2cf1b4d75035ccbafe44e88e705e6c2 --- /dev/null +++ b/scraped_kb_articles/databricks-connect-job-fails-after-a-databricks-runtime-update.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/databricks-connect-job-fails-after-a-databricks-runtime-update", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour legacy Databricks Connect jobs start failing with a\njava.lang.ClassCastException\nerror message. The error is not associated with any specific commands but seems to affect multiple Databricks Connect commands or jobs.\nCaused by: java.lang.ClassCastException: cannot assign instance of org.apache.spark.sql.catalyst.trees.TreePattern$ to field org.apache.spark.sql.catalyst.trees.TreePattern$.WITH_WINDOW_DEFINITION of type scala.Enumeration$Value in instance of org.apache.spark.sql.catalyst.trees.TreePattern$at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)\nCause\nYour cluster is running the latest maintenance release for your chosen Databricks Runtime, but you did not update the version of Databricks Connect you are using to connect to the cluster.\nDatabricks Connect requires the client version to match the Databricks Runtime version on your compute cluster.\nIf the Databricks Connect client version does not correspond to the Databricks Runtime version on your cluster, you may get an error message.\nSolution\nThe Databricks Connect package must be kept in sync with the corresponding Databricks Runtime release.\nReview the\nDatabricks Connect release notes\n(\nAWS\n|\nAzure\n|\nGCP\n) to determine the correct version to use with your selected Databricks Runtime." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-dashboards-deployed-with-databricks-asset-bundles-dab-are-duplicated-on-deployment.json b/scraped_kb_articles/databricks-dashboards-deployed-with-databricks-asset-bundles-dab-are-duplicated-on-deployment.json new file mode 100644 index 0000000000000000000000000000000000000000..cf8565722bbf002aa7dead2d256748858ce1429c --- /dev/null +++ b/scraped_kb_articles/databricks-dashboards-deployed-with-databricks-asset-bundles-dab-are-duplicated-on-deployment.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/databricks-dashboards-deployed-with-databricks-asset-bundles-dab-are-duplicated-on-deployment", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you deploy new dashboards using Databricks Asset Bundles (DABs), you notice a duplicate dashboard is created with the name of the JSON definition file. The duplicate dashboard has the same charts as the primary dashboard, but all visuals show\n\"Unable to render visual\"\n.\nCause and Solution\nRefer to the databricks/cli\nDuplicate dashboards deployed #2910\nGithub issue for an explanation and solution.\nThe solution in the Github issue links to official Databricks documentation, the “sync” section of the\nDatabricks Asset Bundle configuration\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-runtime-is-not-able-to-read-data-in-a-format-other-than-delta.json b/scraped_kb_articles/databricks-runtime-is-not-able-to-read-data-in-a-format-other-than-delta.json new file mode 100644 index 0000000000000000000000000000000000000000..d1841f8221d390d4d839815cced016bf96254182 --- /dev/null +++ b/scraped_kb_articles/databricks-runtime-is-not-able-to-read-data-in-a-format-other-than-delta.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/databricks-runtime-is-not-able-to-read-data-in-a-format-other-than-delta", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to read from a specified path using the\nformat(\"parquet”)\n,\nformat(“cloudfiles”)\n, or any non-Delta text format, or if you attempt to query a non-Delta table from DBSQL, you receive the following error message.\nError Stack trace:\r\n\"A transaction log for Delta was found at //_delta_log, but you are trying to read from /Volumes/// using format(). You must use 'format(\"delta\")' when reading and writing to a delta table.\"\nCause\nA Delta table is present at the root level, and the query is trying to read from a subdirectory using your indicated\nformat\noption.\nWhen a Delta table is created, it generates a transaction log at the\n_delta_log\nlocation. When a query is executed, it checks if the target location is a Delta table or not. If a Delta table is found, it expects the\nformat(\"delta\")\noption to be used. The presence of a Delta table at the root level can occur when a user modifies the path or creates a Delta table at the root level unintentionally. This can lead to queries failing with the error message in the problem statement.\nSolution\nIf you use serverless, either refine the table’s definitions so each table points to a distinct source path, or remove the Delta table at the root level by deleting the\n_delta_log\nfolder.\ndbutils.fs.rm(\"dbfs:/_delta_log/\", True)\nImportant\nBefore making any changes, ensure you have a backup of your data to avoid any data loss.\nIf you are unable to delete the\n_delta_log\nfolder, you can instead move the transaction log to any different folder.\nIf you are not using serverless, you can disable a table’s format assumption for the engine when reading it. Set the following configuration to tell Apache Spark to ignore the formats when reading the data.\n`spark.databricks.delta.formatCheck.enabled=false`\nTo avoid similar issues in the future, follow these best practices.\nVerify the path and format options used in your queries to avoid conflicts with existing Delta tables.\nRegularly review your data structure and clean up files and folders no longer needed.\nAvoid creating different tables which point to the same source path." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-spark-submit-jobs-appear-to-%E2%80%9Chang%E2%80%9D-and-clusters-do-not-auto-terminate-.json b/scraped_kb_articles/databricks-spark-submit-jobs-appear-to-%E2%80%9Chang%E2%80%9D-and-clusters-do-not-auto-terminate-.json new file mode 100644 index 0000000000000000000000000000000000000000..8cabae4bc9094c2ef94e55632fce545fc1aab1a6 --- /dev/null +++ b/scraped_kb_articles/databricks-spark-submit-jobs-appear-to-%E2%80%9Chang%E2%80%9D-and-clusters-do-not-auto-terminate-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/databricks-spark-submit-jobs-appear-to-%E2%80%9Chang%E2%80%9D-and-clusters-do-not-auto-terminate-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDatabricks\nspark-submit\njobs appear to “hang,” either after the user class’s main method completes (regardless of success or failure, for Java / Scala jobs), or upon a Python script exit (for Python jobs). Clusters do not auto-terminate in this case.\nCause\nFor Java, Scala, and Python programs, Spark does not automatically call\nSystem.exit\n.\nFor context, the Java Virtual Machine initiates the shutdown sequence in response to one of three events.\nWhen the number of\nlive\nnon-daemon threads drops to zero for the first time.\nWhen the\nRuntime.exit\nor\nSystem.exit\nmethod is called for the first time.\nWhen an external event occurs such as an interrupt, or a signal is received from the operating system.\nSolution\nEmbed\nsystem.exit\ncode in your application to shutdown the Java virtual machine with exit code 0.\nExamples\nPython\nimport sys\r\nsc = SparkSession.builder.getOrCreate().sparkContext  # Or otherwise obtain handle to SparkContext\r\nrunTheRestOfTheUserCode()\r\n# Fall through to exit with code 0 in case of success, since failure will throw an uncaught exception\r\n# and won't reach the exit(0) and thus will trigger a non-zero exit code that will be handled by\r\n# PythonRunner\r\nsc._gateway.jvm.System.exit(0)\nScala\ndef main(args: Array[String]): Unit = {\r\n try {\r\n    runTheRestOfTheUserCode()\r\n } catch {\r\n    case t: Throwable =>\r\n      try {\r\n        // Log the throwable or error here\r\n      } finally {\r\n        System.exit(1)\r\n      }\r\n }\r\n System.exit(0)\r\n}\nLong-term fix\nYou can also track Spark’s long-term fix at\nSPARK-48547\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-sql-python-package-fails-with-self-signed-certificate-errors-and-code-_sslc1006.json b/scraped_kb_articles/databricks-sql-python-package-fails-with-self-signed-certificate-errors-and-code-_sslc1006.json new file mode 100644 index 0000000000000000000000000000000000000000..6cc7678f8797bb821cf8f08f042c1f1fcb0ceedc --- /dev/null +++ b/scraped_kb_articles/databricks-sql-python-package-fails-with-self-signed-certificate-errors-and-code-_sslc1006.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/databricks-sql-python-package-fails-with-self-signed-certificate-errors-and-code-_sslc1006", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to connect using the\ndatabricks-sql-python\npackage from your local Python environment or using development tools like VSCode or PyCharm, you receive a certificate error.\nMaxRetryError (note: full exception trace is shown but execution 1is paused at: _run_module_as_main)\r\nHTTPSConnectionPool(host='my-workspace.cloud.databricks.com', port=443): Max retries exceeded with url: /sql/protocolv1/o/workspaceID/xxxx-xxxxxx-xxxxxxxx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)')))\nCause\nThe custom certificate (self-signed certificate)\nwithin your local Python environment does not exist, or the connection cuts off when trying to communicate with the Databricks endpoint.\nSolution\nMake sure the current Python self-certificate used with the\ndatabricks-sql-python\npackage is valid and updated. You may need to first confirm the details for your particular OS environment and verify with your IT Team that such certificates are valid within your network.\nFor more information on certificates, please refer to the Python\ncertifi 2024.8.30\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-sql-warehouse-query-finish-with-cache-is-not-populating-open-in-spark-ui-option.json b/scraped_kb_articles/databricks-sql-warehouse-query-finish-with-cache-is-not-populating-open-in-spark-ui-option.json new file mode 100644 index 0000000000000000000000000000000000000000..1ac40b24f39a02f09fcecf57f7844ac9988ef8d9 --- /dev/null +++ b/scraped_kb_articles/databricks-sql-warehouse-query-finish-with-cache-is-not-populating-open-in-spark-ui-option.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/databricks-sql-warehouse-query-finish-with-cache-is-not-populating-open-in-spark-ui-option", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Databricks SQL, typically when queries finish you can click the kebab menu and choose\nOpen in Spark UI\nfrom the dropdown menu. However, for queries that finish with cache, you notice the\nOpen in Spark UI\noption is missing from the menu.\nThe following image shows the menu for a query without cache.\nConversely, the following image shows the menu for a query with cache, missing the\nOpen in Spark UI\noption.\nCause\nThe Spark UI displays information about Spark jobs and stages. When a query uses 100% of the cache, there are no corresponding Apache Spark jobs or stages launched, so there is no relevant information to display. The menu option\nOpen in Spark UI\nis not applicable.\nSolution\nTo access the Spark UI for the corresponding warehouse instead:\nSelect any query that has run and triggered Spark jobs or stages.\nUse that query's URL page to access the Spark UI." +} \ No newline at end of file diff --git a/scraped_kb_articles/databricks-support-for-ipv6.json b/scraped_kb_articles/databricks-support-for-ipv6.json new file mode 100644 index 0000000000000000000000000000000000000000..81a8abb630d59946ceaf3c6cf0dede6553c46eef --- /dev/null +++ b/scraped_kb_articles/databricks-support-for-ipv6.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/databricks-support-for-ipv6", + "title": "Título do Artigo Desconhecido", + "content": "IPv6 is not currently supported." +} \ No newline at end of file diff --git a/scraped_kb_articles/dataframe-in-an-interactive-cluster-still-showing-cached-data-after-calling-unpersist-function.json b/scraped_kb_articles/dataframe-in-an-interactive-cluster-still-showing-cached-data-after-calling-unpersist-function.json new file mode 100644 index 0000000000000000000000000000000000000000..4e619ef22417f1de35bfd0744320be2d2674aaa4 --- /dev/null +++ b/scraped_kb_articles/dataframe-in-an-interactive-cluster-still-showing-cached-data-after-calling-unpersist-function.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/dataframe-in-an-interactive-cluster-still-showing-cached-data-after-calling-unpersist-function", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn an interactive cluster, you notice when you unpersist DataFrames cached in\ndf.cache()\nby calling\ndf.unpersist()\n, the old data continues to be referenced.\nCause\nThe\nunpersist()\nfunction is a non-blocking call by default. The following function marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk.\nDataFrame.unpersist(blocking: bool = False) → pyspark.sql.dataframe.DataFrame\nHowever, both\nunpersist()\nand\nunpersist(blocking = False)\nstill allow other operations until memory pressure forces block deletion.\nSolution\nIn your code, set\nunpersist(blocking=True)\nto ensure the\nunpersist()\noperation completes before proceeding with further actions. This will ensure that the cached data is fully cleared from memory before any subsequent actions are performed.\nFor more information, refer to the\npyspark.sql.DataFrame.unpersist\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/dataframe-to-ray-dataset-conversions-taking-a-long-time-to-execute.json b/scraped_kb_articles/dataframe-to-ray-dataset-conversions-taking-a-long-time-to-execute.json new file mode 100644 index 0000000000000000000000000000000000000000..640298c84d446ce43c0c8a98399428eb2fd1bf28 --- /dev/null +++ b/scraped_kb_articles/dataframe-to-ray-dataset-conversions-taking-a-long-time-to-execute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/dataframe-to-ray-dataset-conversions-taking-a-long-time-to-execute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen converting a DataFrame to a Ray Dataset using Ray versions below 2.10, you notice long execution times, especially for small datasets.\nCause\nEvery Apache Spark job requests one GPU per Spark task, but the GPU is already occupied by the Ray worker node. This makes the Spark job hang, since the Spark task resources can't be allocated.\nThe Ray cluster alternatively may be using all available vCPU resources on the workers, so there are no free vCPU cores for the Spark tasks.\nSolution\nSet the Spark configuration\nspark.task.resource.gpu.amount\nto\n0\nin your cluster’s configuration. This means no GPU resources will be allocated to the Spark tasks while the Ray cluster uses the GPU.\nIn your workspace, navigate to\nCompute\n.\nClick your cluster to open the settings.\nScroll down to\nAdvanced options\nand click to expand.\nIn the\nSpark\ntab, enter\nspark.task.resource.gpu.amount 0\nin the\nSpark config\nbox.\nAlternatively, update your Ray cluster configuration to modify\nnum_cpus_worker_node\nto be less than the total number of CPUs available on each worker type. For example, if you have both\nnum_worker_nodes\nand\nnum_cpus_worker_node\nset to four, reduce\nnum_cpus_worker_node\nto three.\nLast, you can enable Spark cluster auto-scaling by setting\nautoscale=True\nin the\nsetup_ray_cluster\nfunction. If you are configuring both the Databricks and Ray-on-Spark clusters to be auto-scalable for better resource management please refer to the\nScale Ray clusters on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/datatype_mismatchcast_without_suggestion-error-when-querying-views-of-system-tables.json b/scraped_kb_articles/datatype_mismatchcast_without_suggestion-error-when-querying-views-of-system-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..5ec4becb1b94c4dbd7a099f2a163487a54e53da3 --- /dev/null +++ b/scraped_kb_articles/datatype_mismatchcast_without_suggestion-error-when-querying-views-of-system-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/datatype_mismatchcast_without_suggestion-error-when-querying-views-of-system-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are querying views created on top of system tables, such as the billing usage table, when you get a scheme mismatch error message.\nThe error typically manifests as a\nDATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION\nerror, indicating that certain fields in the view do not match the current table schema.\nCause\nViews created on top of tables do not support schema evolution by default. When the underlying table's schema is updated (for example, new fields are added), the view's schema remains static, leading to a mismatch between the view and the table it references. This mismatch causes errors when querying the view, as it attempts to cast data to an outdated schema.\nSolution\nUpdate to Databricks Runtime 15.4 LTS or above to use schema evolution for views. For details, refer to the\nCREATE VIEW\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nThen, when creating new views or recreating existing ones, use the\nWITH SCHEMA EVOLUTION\nclause. This ensures that the view's schema automatically updates when changes occur in the underlying table. You can use the following example code.\n%sql\r\n\r\nCREATE VIEW .. WITH SCHEMA EVOLUTION AS\r\nSELECT * FROM WHERE \nDatabricks recommends prioritizing recreating views on frequently updated tables or those crucial to your operations. For views created before you implemented schema evolution, be sure to use the\nWITH SCHEMA EVOLUTION\nclause when recreating them to enable automatic schema updates." +} \ No newline at end of file diff --git a/scraped_kb_articles/date-int-only-spark-30.json b/scraped_kb_articles/date-int-only-spark-30.json new file mode 100644 index 0000000000000000000000000000000000000000..064c9a627fdd0f993b459c7adcc7e7d18cb4aa5d --- /dev/null +++ b/scraped_kb_articles/date-int-only-spark-30.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/date-int-only-spark-30", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to use the\ndate_add()\nor\ndate_sub()\nfunctions in Spark 3.0, but they are returning an\nError in SQL statement: AnalysisException\nerror message.\nIn Spark 2.4 and below, both functions work as normal.\n%sql\r\n\r\nselect date_add(cast('1964-05-23' as date), '12.34')\nCause\nYou are attempting to use a fractional or string value as the second argument.\nIn Spark 2.4 and below, if the second argument is a fractional or string value, it is coerced to an\nint\nvalue before\ndate_add()\nor\ndate_sub()\nis evaluated.\nUsing the example code listed above, the value 12.34 is converted to\n12\nbefore\ndate_add()\nis evaluated.\nIn Spark 3.0, if the second argument is a fractional or string value, it returns an error.\nSolution\nUse\nint\n,\nsmallint\n, or\ntinyint\nvalues as the second argument for the\ndate_add()\nor\ndate_sub()\nfunctions in Spark 3.0.\n%sql\r\n\r\nselect date_add(cast('1964-05-23' as date), '12')\n%sql\r\n\r\nselect date_add(cast('1964-05-23' as date), 12)\nBoth of these examples work properly in Spark 3.0.\nDelete\nInfo\nIf you are importing this data from another source, you should create a routine to sanitize the values and ensure the data is in integer form before passing it to one of the date functions." +} \ No newline at end of file diff --git a/scraped_kb_articles/dbcli-win-fail-create-process.json b/scraped_kb_articles/dbcli-win-fail-create-process.json new file mode 100644 index 0000000000000000000000000000000000000000..44ac9c71203c4f892c983229327d8735008a37d6 --- /dev/null +++ b/scraped_kb_articles/dbcli-win-fail-create-process.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/dbcli-win-fail-create-process", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile trying to access the Databricks CLI (\nAWS\n|\nAzure\n|\nGCP\n) in Windows, you get a\nfailed to create process\nerror message.\nCause\nThis can happen:\nIf multiple instances of the Databricks CLI are installed on the system.\nIf the Python path on your Windows system includes a space.\nDelete\nInfo\nThere is a\nknown issue in pip\nwhich causes pip installed software to fail if there is a space in your Python path.\nSolution\nEnsure that you do not have multiple instances of the Databricks CLI installed by running\nwhere databricks\n.\nIf you do have multiple instances installed, delete all instances except the one in the user profile path.\nEnsure that Python is installed to a path without spaces, or ensure that you have enclosed the path in quotes when it is referenced on the first line of any script in the\n\\Scripts\ndirectory.\nIf the first line of your script looks like this, it will fail:\n#!c:\\program files\\python\\python38\\python.exe\nIf the first line of your script looks like this, it will work correctly:\n#!\"c:\\program files\\python\\python38\\python.exe\"" +} \ No newline at end of file diff --git a/scraped_kb_articles/dbconnect-incompatible-dbr64.json b/scraped_kb_articles/dbconnect-incompatible-dbr64.json new file mode 100644 index 0000000000000000000000000000000000000000..676d32e139f7c5d90326e8cbacf0d077ac297974 --- /dev/null +++ b/scraped_kb_articles/dbconnect-incompatible-dbr64.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/dbconnect-incompatible-dbr64", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using the Databricks Connect client with Databricks Runtime 6.4 and receive an error message which states that the client does not support the cluster.\nCaused by: java.lang.IllegalArgumentException: The cluster is running server version `dbr-6.4` but this client only supports Set(dbr-5.5). You can find a list of client releases at https://pypi.org/project/databricks-connect/#history, and install the right client version with `pip install -U databricks-connect==`. For example, to install the latest 5.1 release, use `pip install -U databricks-connect==5.1.*`. To ignore this error and continue, set DEBUG_IGNORE_VERSION_MISMATCH=1.\nCause\nImprovements were made to Databricks Runtime 6.4 which are incompatible with the Databricks Connect client 6.4.1 and below.\nSolution\nUpgrade the Databricks Connect client to 6.4.2.\nFollow the documentation to set up the client (\nAWS\n|\nAzure\n) on your local workstation, making sure to set the\ndatabricks-connect\nvalue to 6.4.2.\npip install databricks-connect==6.4.2\nDelete\nWarning\nSetting\nDEBUG_IGNORE_VERSION_MISMATCH=1\nis not recommended, as it does not resolve the underlying compatibility issues." +} \ No newline at end of file diff --git a/scraped_kb_articles/dbconnect-protoserializer-stackoverflow.json b/scraped_kb_articles/dbconnect-protoserializer-stackoverflow.json new file mode 100644 index 0000000000000000000000000000000000000000..8c15d208ef39f833b794d0a64c55617dfc73034c --- /dev/null +++ b/scraped_kb_articles/dbconnect-protoserializer-stackoverflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/dbconnect-protoserializer-stackoverflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using DBConnect (\nAWS\n|\nAzure\n|\nGCP\n) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error.\npy4j.protocol.Py4JJavaError: An error occurred while calling o945.count.\r\n: java.lang.StackOverflowError\r\n    at java.lang.Class.getEnclosingMethodInfo(Class.java:1072)\r\n    at java.lang.Class.getEnclosingClass(Class.java:1272)\r\n    at java.lang.Class.getSimpleBinaryName(Class.java:1443)\r\n    at java.lang.Class.getSimpleName(Class.java:1309)\r\n    at org.apache.spark.sql.types.DataType.typeName(DataType.scala:67)\r\n    at org.apache.spark.sql.types.DataType.simpleString(DataType.scala:82)\r\n    at org.apache.spark.sql.types.DataType.sql(DataType.scala:90)\r\n    at org.apache.spark.sql.util.ProtoSerializer.serializeDataType(ProtoSerializer.scala:3207)\r\n    at org.apache.spark.sql.util.ProtoSerializer.serializeAttrRef(ProtoSerializer.scala:3610)\r\n    at org.apache.spark.sql.util.ProtoSerializer.serializeAttr(ProtoSerializer.scala:3600)\r\n    at org.apache.spark.sql.util.ProtoSerializer.serializeNamedExpr(ProtoSerializer.scala:3537)\r\n    at org.apache.spark.sql.util.ProtoSerializer.serializeExpr(ProtoSerializer.scala:2323)\r\n    at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:3001)\r\n    at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:2998)\nPerforming the same operation in a notebook works correctly and does not produce an error.\nExample code\nYou can reproduce the error with this sample code.\nIt creates a DataFrame with 200 columns and renames them all.\nThis sample code runs correctly in a notebook, but results in an error when run in DBConnect.\n%python\r\n\r\ndf = spark.createDataFrame([{str(i) : i for i in range(2000)}])\r\ndf = spark.createDataFrame([{str(i) : i for i in range(200)}])\r\nfor col in df.columns:\r\n  df = df.withColumnRenamed(col, col + \"_a\")\r\ndf.collect()\nCause\nWhen you run code in DBConnect, some functions are handled on the remote cluster driver, but some are handled locally on the client PC.\nIf enough memory is not allocated on the local PC, you get an error.\nSolution\nYou should increase the memory allocated to the Apache Spark driver on the local PC.\nRun\ndatabricks-connect get-spark-home\non your local PC to get the\n${spark_home}\nvalue.\nNavigate to the\n${spark_home}/conf/\nfolder.\nOpen the\nspark-defaults.conf\nfile.\nAdd the following settings to the\nspark-defaults.conf\nfile:\nspark.driver.memory 4g\r\nspark.driver.extraJavaOptions -Xss32M\nSave the changes.\nRestart DBConnect.\nDelete\nWarning\nDBConnect only works with supported Databricks Runtime versions. Ensure that you are using a supported runtime on your cluster before using DBConnect." +} \ No newline at end of file diff --git a/scraped_kb_articles/dbconnect-spark-session-null.json b/scraped_kb_articles/dbconnect-spark-session-null.json new file mode 100644 index 0000000000000000000000000000000000000000..80ed02263a9ef3d376722b221c5fd7035c8372f9 --- /dev/null +++ b/scraped_kb_articles/dbconnect-spark-session-null.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/dbconnect-spark-session-null", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to run your code using Databricks Connect\n(\nAWS\n|\nAzure\n|\nGCP\n)\nwhen you get a\nsparkSession is null\nerror message.\njava.lang.AssertionError: assertion failed: sparkSession is null while trying to executeCollectResult\r\nat scala.Predef$.assert(Predef.scala:170)\r\nat org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:323)\r\nat org.apache.spark.sql.Dataset$$anonfun$50.apply(Dataset.scala:3351)\r\nat org.apache.spark.sql.Dataset$$anonfun$50.apply(Dataset.scala:3350)\r\nat org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3485)\r\nat org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3480)\r\nat org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)\r\nat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)\r\nat org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)\r\nat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)\r\nat org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3480)\r\nat org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3350)\r\nat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\r\nat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\r\nat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\r\nat java.lang.reflect.Method.invoke(Method.java:498)\r\nat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\r\nat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)\r\nat py4j.Gateway.invoke(Gateway.java:295)\r\nat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\r\nat py4j.commands.CallCommand.execute(CallCommand.java:79)\r\nat py4j.GatewayConnection.run(GatewayConnection.java:251)\r\nat java.lang.Thread.run(Thread.java:748)\nCause\nYou get the\nsparkSession is null\nerror message if a Spark session is not active on your cluster when you try to run your code using DBConnect.\nSolution\nYou must ensure that a Spark session is active on your cluster before you attempt to run your code locally using DBConnect.\nYou can use the following Python example code to check for a Spark session and create one if it does not exist.\n%python\r\n\r\nfrom pyspark.sql import SparkSession\r\nspark = SparkSession.builder.getOrCreate()\nDelete\nWarning\nDBConnect only works with supported Databricks Runtime versions. Ensure that you are using a supported runtime on your cluster before using DBConnect." +} \ No newline at end of file diff --git a/scraped_kb_articles/dbfs-file-size-limit.json b/scraped_kb_articles/dbfs-file-size-limit.json new file mode 100644 index 0000000000000000000000000000000000000000..695a414f8b1b6c6b35f39fe61f27566f58a9a606 --- /dev/null +++ b/scraped_kb_articles/dbfs-file-size-limit.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/dbfs-file-size-limit", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs.\nProblem\nIf you mount a folder onto\ndbfs://\nand read a file larger than 2GB in a Python API like pandas, you will see following error:\n/databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)()\r\n/databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6883)()\r\nIOError: Initializing from file failed\nCause\nThe error occurs because one argument in the Python method to read a file is a signed int, the length of the file is an int, and if the object is a file larger than 2GB, the length can be larger than maximum signed int.\nSolution\nMove the file from\ndbfs://\nto local file system (\nfile://\n). Then read using the Python API. For example:\nCopy the file from\ndbfs://\nto\nfile://\n:\n%fs cp dbfs:/mnt/large_file.csv file:/tmp/large_file.csv\nRead the file in the\npandas\nAPI:\n%python\r\n\r\nimport pandas as pd\r\npd.read_csv('file:/tmp/large_file.csv',).head()" +} \ No newline at end of file diff --git a/scraped_kb_articles/dbfs-init-script-detection-notebook.json b/scraped_kb_articles/dbfs-init-script-detection-notebook.json new file mode 100644 index 0000000000000000000000000000000000000000..9727e14c11bf6f5609d6e4fae6eb78be2f5b1a7e --- /dev/null +++ b/scraped_kb_articles/dbfs-init-script-detection-notebook.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/dbfs-init-script-detection-notebook", + "title": "Título do Artigo Desconhecido", + "content": "On Dec 1, 2023 init scripts stored on DBFS (including legacy global init scripts and cluster-named init scripts) reached End-of-Life.\nDatabricks recommends you migrate any init scripts stored on DBFS to a supported type as soon as possible.\nReview the\nRecommendations for init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information on supported init script types and locations.\nDatabricks Engineering has created a notebook to help you detect init scripts stored on DBFS across clusters, cluster policies, jobs, DLT pipelines in your workspace.\nInstructions\nWarning\nYou must be a Databricks admin to run this notebook.\nPrerequisites\nBefore running this notebook, you should complete the following:\nHave a cluster running Databricks Runtime 13.3 LTS.\nImport the\nhelpers.zip\npackage into your workspace.\nVerify that the\n/helpers\nfolder exists and contains the files\n__init__.py\nand\nchecks.py\n.\nInfo\nBy default this notebook checks the current workspace. If you want to check additional workspaces, you will need the following:\nWorkspace URL for each workspace you want to test.\nAdministrator PAT (\nAWS\n|\nAzure\n|\nGCP\n) (personal access token) for each workspace you want to test.\nDetect init scripts stored on DBFS\nDownload the\nDBFS init script detection notebook\n.\nImport the notebook to your workspace.\nEnsure the notebook is in the root of your workspace storage. It should not be in the\n/helpers\nfolder.\nStart a cluster with Databricks Runtime 13.3 LTS\nRun the notebook.\nOnce the notebook finishes running, it returns a list of init scripts stored on DBFS in your workspace.\nIf there are no init scripts stored on DBFS in your workspace, the notebook returns all of the following messages:\nNo clusters with init scripts on DBFS\nNo clusters with named init scripts on DBFS\nNo jobs with init scripts on DBFS\nThere are no DLT pipelines with init scripts on DBFS\nThere are no cluster policies with references to init scripts on DBFS\nThere are no init scripts that reference files on DBFS\nCheck other workspaces for init scripts stored on DBFS\nAfter running the detection script once in your current workspace a widget is visible at the top of the notebook.\nEnter the\nPAT\nfor the workspace you want to check into the widget.\nEnter the\nWorkspace URL\nfor the workspace you want to check into the widget.\nRe-run the notebook.\nThe results displayed apply to the workspace specified in the widget.\nMigrate your init scripts\nIf the DBFS init script detection notebook detects init scripts you should review the\nMigrate init scripts from DBFS\n(\nAWS\n|\nAzure\n|\nGCP\n)  documentation for further details." +} \ No newline at end of file diff --git a/scraped_kb_articles/dbfs-root-permissions.json b/scraped_kb_articles/dbfs-root-permissions.json new file mode 100644 index 0000000000000000000000000000000000000000..248497ecb4db100c21d8571d1d9084e643080c47 --- /dev/null +++ b/scraped_kb_articles/dbfs-root-permissions.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/dbfs-root-permissions", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAn\nAccess Denied\nerror returns when you attempt to read Databricks objects stored in the DBFS root directory in blob storage from outside a Databricks cluster.\nCause\nThis is normal behavior for the DBFS root directory. Databricks stores objects like libraries and other temporary system files in the DBFS root directory. Databricks is the only user that can read these objects.\nSolution\nDatabricks does not recommend using the root directory for storing any user files or objects. Instead, create a different blob storage directory and mount it to DBFS." +} \ No newline at end of file diff --git a/scraped_kb_articles/decreased-performance-when-using-delete-with-a-subquery-on-databricks-runtime-104-lts.json b/scraped_kb_articles/decreased-performance-when-using-delete-with-a-subquery-on-databricks-runtime-104-lts.json new file mode 100644 index 0000000000000000000000000000000000000000..a419685e150c68f46e91aebe4e5adc1292c7ae93 --- /dev/null +++ b/scraped_kb_articles/decreased-performance-when-using-delete-with-a-subquery-on-databricks-runtime-104-lts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/decreased-performance-when-using-delete-with-a-subquery-on-databricks-runtime-104-lts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAuto optimize on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively.\nAlthough auto optimize can be beneficial in many situations, you can see decreased performance on Databricks Runtime 10.4 LTS when you have a\nDELETE\nwith a subquery where one side is small enough to be broadcast.\nFor instance the query may look like follows:\nDELETE FROM  \r\nWHERE Date = <'SampleDate'>\r\nAND SampleID IN (\r\n        SELECT MatchId FROM WHERE MatchId = 'Value')\nCause\nOptimized writes are enabled by default for\nDELETE\nwith a subquery, on Databricks Runtime 10.4 LTS, on the assumption the data will be shuffled. In situations where one side is small enough to be broadcast, this does not happen and you may see a performance hit.\nSolution\nIf you encounter this issue, and you do not want to upgrade to a newer Databricks Runtime, you should disable auto optimize in your Delta table by setting\ndelta.autoOptimize.optimizeWrite = false\nin the table properties.\nYou should also set this value in your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n):\nspark.databricks.delta.delete.forceOptimizedWrites = false\nDelete\nInfo\nDatabricks Runtime 11.2 and above disables auto optimized writes for\nDELETE\nwith subqueries by default." +} \ No newline at end of file diff --git a/scraped_kb_articles/delegating-principal-error-on-published-dashboard.json b/scraped_kb_articles/delegating-principal-error-on-published-dashboard.json new file mode 100644 index 0000000000000000000000000000000000000000..2bf620693f5fc0c488a406eb8ab76c5cc0ff3f51 --- /dev/null +++ b/scraped_kb_articles/delegating-principal-error-on-published-dashboard.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/visualizations/delegating-principal-error-on-published-dashboard", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to access a published dashboard, you receive the following error.\nFailed to execute query: Delegating principal is not a member of workspace \nCause\nThe dashboard is configured to be viewed with the publisher's credentials (referred to as embedded credentials) but the publisher has been removed from the workspace. Their credentials are no longer valid, so the query execution fails.\nSolution\nHave a current active user in the workspace republish the dashboard with embedded credentials. This regenerates the access credentials tied to a valid workspace user, allowing the dashboard to function correctly.\nFor details on embedded credentials, refer to the\nShare a dashboard\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/delete-checkpoint-restart.json b/scraped_kb_articles/delete-checkpoint-restart.json new file mode 100644 index 0000000000000000000000000000000000000000..dff6d0b43b5860d8e7be9be0ed288e20308572f6 --- /dev/null +++ b/scraped_kb_articles/delete-checkpoint-restart.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delete-checkpoint-restart", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour job fails with a\nDelta table doesn't exist. Please delete your streaming query checkpoint and restart.\nerror message.\nCause\nTwo different streaming sources are configured to use the same checkpoint directory. This is not supported.\nFor example, assume streaming query A streams data from Delta table A, and uses the directory\n/checkpoint/A\nas a checkpoint.\nIf streaming query B streams data from Delta table B, but attempts to use the directory\n/checkpoint/A\nas a checkpoint, the\nreservoirId\nof the Delta tables doesn’t match and the query fails with an exception.\nAWS\nDelete\nInfo\nA similar issue can occur with S3-SQS if you attempt to share the checkpoint directory. This is because S3-SQS uses an internal Delta table to maintain the event messages.\nDelete\nAzure\nDelete\nInfo\nA similar issue can occur with ABS-AQS if you attempt to share the checkpoint directory. This is because ABS-AQS uses an internal Delta table to maintain the event messages.\nDelete\nSolution\nYou should not share checkpoint directories between different streaming queries.\nUse a new checkpoint directory for every new streaming query." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-cache-autoscaling.json b/scraped_kb_articles/delta-cache-autoscaling.json new file mode 100644 index 0000000000000000000000000000000000000000..dd07fe692d3e75d3e23c738e0d7c7efba6cc693a --- /dev/null +++ b/scraped_kb_articles/delta-cache-autoscaling.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-cache-autoscaling", + "title": "Título do Artigo Desconhecido", + "content": "This article is about how Delta cache (\nAWS\n|\nAzure\n|\nGCP\n) behaves on an auto-scaling cluster, which removes or adds nodes as needed.\nWhen a cluster downscales and terminates nodes:\nA Delta cache behaves in the same way as an RDD cache. Whenever a node goes down, all of the cached data in that particular node is lost. Delta cache data is not moved from the lost node.\nWhen a cluster upscales and adds new nodes:\nWhenever a cluster adds a new node, data is not moved between caches. Lost data is re-cached the next time an application accesses the data or tables again." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-external-table-metadata-does-not-match-the-catalog-explorer-view.json b/scraped_kb_articles/delta-external-table-metadata-does-not-match-the-catalog-explorer-view.json new file mode 100644 index 0000000000000000000000000000000000000000..5c6f4d1547405b9260edf1973bab0b24ebbf5195 --- /dev/null +++ b/scraped_kb_articles/delta-external-table-metadata-does-not-match-the-catalog-explorer-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/delta-external-table-metadata-does-not-match-the-catalog-explorer-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice the metadata for a Delta external table is not synchronizing with the\nCatalog explorer\nview. Certain columns are visible in the table but not in the\nCatalog explorer\nview.\nCause\nThe external table was updated outside of Databricks.\nSolution\nRun the following command, which reads from the Delta log of the table and updates the metadata in the Unity Catalog service.\nMSCK REPAIR TABLE SYNC METADATA" +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-live-tables-dlt-cluster-keeps-running-after-pipeline-completes.json b/scraped_kb_articles/delta-live-tables-dlt-cluster-keeps-running-after-pipeline-completes.json new file mode 100644 index 0000000000000000000000000000000000000000..c396559f01738781aa5f573ecd718e4d0d471df8 --- /dev/null +++ b/scraped_kb_articles/delta-live-tables-dlt-cluster-keeps-running-after-pipeline-completes.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/delta-live-tables-dlt-cluster-keeps-running-after-pipeline-completes", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Delta Live Tables (DLT) cluster remains active for an extended period, even after the associated DLT pipeline has completed its execution.\nCause\nThe DLT pipeline is in development mode. Development mode configuration keeps a cluster alive for two hours after an update completes to facilitate quick iteration and testing.\nThis behavior is expected and is part of the default settings for development mode in DLT pipelines.\nSolution\nSet the pipeline to\nProduction\nmode, which does not keep the cluster alive after the update termination.\nClick\nDelta Live Tables\nin the lefthand menu.\nSelect the pipeline to change.\nClick the\nDevelopment/Production\ntoggle on the right side of the UI to switch to\nProduction\n.\nIf you continue to use development mode, set the\npipelines.clusterShutdown.delay\nparameter to your preferred duration. This allows for customization of the cluster's shutdown behavior based on specific needs.\nFor detailed instructions on changing DLT pipeline settings, refer to the\nConfigure pipeline settings for Delta Live Tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-live-tables-dlt-pipeline-fails-with-sparkoutofmemoryerror.json b/scraped_kb_articles/delta-live-tables-dlt-pipeline-fails-with-sparkoutofmemoryerror.json new file mode 100644 index 0000000000000000000000000000000000000000..9ebe09c1aaa387a5f142c1ea63236b6c19a284e8 --- /dev/null +++ b/scraped_kb_articles/delta-live-tables-dlt-pipeline-fails-with-sparkoutofmemoryerror.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/delta-live-tables-dlt-pipeline-fails-with-sparkoutofmemoryerror", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Delta Live Table (DLT) pipeline fails with a\n`SparkOutOfMemoryError`\n. You read in the error message the pipeline is unable to acquire enough memory.\nError Message:\r\norg.apache.spark.memory.SparkOutOfMemoryError:[UNABLE_TO_ACQUIRE_MEMORY] Unable to acquire 104857600 bytes of memory, got 38125049.SQLSTATE: 53200\nCause\nThe various Apache Spark components, including Photon engines, do not have sufficient resources for memory acquisition.\nSeveral factors contribute to insufficient resources.\nThe DLT default configurations for memory allocation to your work instance type may not be sufficient for the pipeline’s requirements.\nThe cluster mode and instance type may not be optimized for the pipeline's workload.\nThe pipeline may be running on a Preview channel, which is not recommended for production workloads.\nSolution\nDepending on your situation, either upgrade your worker instance type or switch your channel from Preview to Current.\nUpgrade worker instance type\nChange your worker instance type to one with higher memory capacity than the default. You can check the current instance types assigned in DLT by navigating to\nPipeline settings > Advanced > Instance Types\n.\nFor more information, refer to the “Compute configuration options” section of the\nConfigure a DLT pipeline\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation and the “Select instance types to run a pipeline” section of the\nConfigure compute for a DLT pipeline\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSwitch the channel to Current\nTo switch the pipeline from the Preview channel to the Current channel, in the UI navigate to\nPipeline settings > Advanced > Channel\nand choose\n‘Current’\nfrom the dropdown menu. Save the pipeline settings to update." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-live-tables-job-fails-when-using-collect.json b/scraped_kb_articles/delta-live-tables-job-fails-when-using-collect.json new file mode 100644 index 0000000000000000000000000000000000000000..956464d31f4f990a476b41e53106dab84e8f5376 --- /dev/null +++ b/scraped_kb_articles/delta-live-tables-job-fails-when-using-collect.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/delta-live-tables-job-fails-when-using-collect", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using\ncollect()\nin your Delta Live Tables (DLT) pipeline code and you get an error. When you review the stack trace, you see a\nDataFrame.collect\nerror that says the function is going to be deprecated soon.\n\"message\": \"Notebook:/path/to/your/notebook used `DataFrame.collect` function that will be deprecated soon. Please fix the notebook.\",\r\n\"details\": {\r\n    \"unsupported_operation\": {\r\n        \"operation\": \"COLLECT_TO_DRIVER\"\r\n        }\r\n    },\r\n\"event_type\": \"unsupported_operation\"\nCause\nWhen using Delta Live Tables, the Python\ntable\nand\nview\nfunctions must return a DataFrame. You should not use functions such as\ncollect()\n,\ncount()\n,\ntoPandas()\n,\nsave()\n, and\nsaveAsTable()\n. These do not return DataFrames and should not be used within the\ntable\nand\nview\nfunction definitions.\nPlease review the Delta Live Tables Python limitations (\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information.\nSolution\nYou should not use these functions in your table and views." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-live-tables-pipelines-are-not-running-vacuum-automatically.json b/scraped_kb_articles/delta-live-tables-pipelines-are-not-running-vacuum-automatically.json new file mode 100644 index 0000000000000000000000000000000000000000..8e9a7a47066c811ff419ccb6f423bb2e51b7addc --- /dev/null +++ b/scraped_kb_articles/delta-live-tables-pipelines-are-not-running-vacuum-automatically.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-live-tables-pipelines-are-not-running-vacuum-automatically", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDelta Live Tables supports auto-vacuum by default. You setup a Delta Live Tables pipeline, but notice\nVACUUM\nis not running automatically.\nCause\nA Delta Live Tables pipeline needs a separate maintenance\ncluster configuration\n(\nAWS\n|\nAzure\n|\nGCP\n) inside the pipeline settings to ensure\nVACUUM\nruns automatically. If the maintenance cluster is not specified within the pipeline JSON file or if the maintenance cluster does not have access to your storage location, then\nVACUUM\ndoes not run.\nExample configuration\nIn this example Delta Live Tables pipeline JSON file, there is a\ndefault\nlabel which identifies the configuration for the default cluster. This should also contain a\nmaintenance\nlabel that identifies the configuration for the maintenance cluster.\nSince the maintenance cluster configuration is not present,\nVACUUM\ndoes not automatically run.\nAWS\n{\r\n  \"clusters\": [\r\n    {\r\n      \"label\": \"default\",\r\n      \"node_type_id\": \"c5.4xlarge\",\r\n      \"driver_node_type_id\": \"c5.4xlarge\",\r\n      \"num_workers\": 20,\r\n      \"spark_conf\": {\r\n        \"spark.databricks.io.parquet.nativeReader.enabled\": \"false\"\r\n      },\r\n      \"aws_attributes\": {\r\n        \"instance_profile_arn\": \"arn:aws:...\"\r\n      }\r\n    }\r\n ]\r\n}\nDelete\nAzure\n{\r\n  \"clusters\": [\r\n    {\r\n      \"label\": \"default\",\r\n      \"node_type_id\": \"Standard_D3_v2\",\r\n      \"driver_node_type_id\": \"Standard_D3_v2\",\r\n      \"num_workers\": 20,\r\n      \"spark_conf\": {\r\n        \"spark.databricks.io.parquet.nativeReader.enabled\": \"false\"\r\n      }\r\n    }\r\n  ]\r\n}\nDelete\nGCP\n{\r\n  \"clusters\": [\r\n    {\r\n      \"label\": \"default\",\r\n      \"node_type_id\": \"n1-standard-4\",\r\n      \"driver_node_type_id\": \"n1-standard-4\",\r\n      \"num_workers\": 20,\r\n      \"spark_conf\": {\r\n        \"spark.databricks.io.parquet.nativeReader.enabled\": \"false\"\r\n      }\r\n    }\r\n  ]\r\n}\nDelete\nSolution\nConfigure a maintenance cluster in the Delta Live Tables pipeline JSON file.\nYou have to specify configurations for two different cluster types:\nA default cluster where all processing is performed.\nA maintenance cluster where daily maintenance tasks are run.\nEach cluster is identified using the label field.\nThe maintenance cluster is responsible for performing\nVACUUM\nand other maintenance tasks.\nAWS\n{\r\n  \"clusters\": [\r\n    {\r\n      \"label\": \"default\",\r\n      \"node_type_id\": \"\",\r\n      \"driver_node_type_id\": \"\",\r\n      \"num_workers\": 20,\r\n      \"spark_conf\": {\r\n        \"spark.databricks.io.parquet.nativeReader.enabled\": \"false\"\r\n      },\r\n      \"aws_attributes\": {\r\n        \"instance_profile_arn\": \"arn:aws:...\"\r\n      }\r\n    },\r\n    {\r\n      \"label\": \"maintenance\",\r\n      \"aws_attributes\": {\r\n        \"instance_profile_arn\": \"arn:aws:...\"\r\n      }\r\n    }\r\n  ]\r\n}\nDelete\nInfo\nIf the maintenance cluster requires access to storage via an instance profile, you need to specify it with\ninstance_profile_arn\n.\nDelete\nAzure\n{\r\n  \"clusters\": [\r\n    {\r\n      \"label\": \"default\",\r\n      \"node_type_id\": \"Standard_D3_v2\",\r\n      \"driver_node_type_id\": \"Standard_D3_v2\",\r\n      \"num_workers\": 20,\r\n      \"spark_conf\": {\r\n        \"spark.databricks.io.parquet.nativeReader.enabled\": \"false\"\r\n      }\r\n    },\r\n    {\r\n      \"label\": \"maintenance\"\r\n    }\r\n  ]\r\n}\nDelete\nInfo\nIf you need to use Azure Data Lake Storage credential passthrough, or another configuration to access your storage location, specify it for both the default cluster and the maintenance cluster.\nDelete\nGCP\n{\r\n  \"clusters\": [\r\n    {\r\n      \"label\": \"default\",\r\n      \"node_type_id\": \"n1-standard-4\",\r\n      \"driver_node_type_id\": \"n1-standard-4\",\r\n      \"num_workers\": 20,\r\n      \"spark_conf\": {\r\n        \"spark.databricks.io.parquet.nativeReader.enabled\": \"false\"\r\n      }\r\n    },\r\n    {\r\n      \"label\": \"maintenance\"\r\n    }\r\n  ]\r\n}\nDelete\nInfo\nWhen using cluster policies to configure Delta Live Tables clusters, you should apply a single policy to both the default and maintenance clusters.\nDelete" +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-merge-cannot-resolve-field.json b/scraped_kb_articles/delta-merge-cannot-resolve-field.json new file mode 100644 index 0000000000000000000000000000000000000000..6d77f9ba06d9c526b889b099a22b1933b6eb314e --- /dev/null +++ b/scraped_kb_articles/delta-merge-cannot-resolve-field.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-merge-cannot-resolve-field", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting a Delta Merge with automatic schema evolution, but it fails with a\nDelta Merge: cannot resolve 'field' due to data type mismatch\nerror message.\nCause\nThis can happen if you have made changes to the nested column fields.\nFor example, assume we have a column called\nAddress\nwith the fields\nstreetName\n,\nhouseNumber\n, and\ncity\nnested inside.\nAttempting to add an additional field, or remove a field, causes any upcoming insert or update transaction on the table to fail, even if\nmergeSchema\nis true for the transaction.\nSolution\nThis behavior is by design.\nThe Delta automatic schema evolution feature only supports top level columns. Nested fields are not supported.\nPlease review the Delta Lake Automatic schema evolution (\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-merge-into.json b/scraped_kb_articles/delta-merge-into.json new file mode 100644 index 0000000000000000000000000000000000000000..65d5fb6d240deadedabc865d9f9294852835f5d0 --- /dev/null +++ b/scraped_kb_articles/delta-merge-into.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-merge-into", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to trigger partition pruning in Delta Lake\nMERGE INTO\n(\nAWS\n|\nAzure\n|\nGCP\n) queries from Databricks.\nPartition pruning is an optimization technique to limit the number of partitions that are inspected by a query.\nDiscussion\nMERGE INTO\ncan be computationally expensive if done inefficiently. You should partition the underlying data before using\nMERGE INTO\n. If you do not, query performance can be negatively impacted.\nThe main lesson is this: if you know which partitions a\nMERGE INTO\nquery needs to inspect, you should specify them in the query so that partition pruning is performed.\nDemonstration: no partition pruning\nHere is an example of a poorly performing\nMERGE INTO\nquery without partition pruning.\nStart by creating the following Delta table, called\ndelta_merge_into\n:\n%scala\nval df = spark.range(30000000)\n.withColumn(\"par\", ($\"id\" % 1000).cast(IntegerType))\n.withColumn(\"ts\", current_timestamp())\n.write\n.format(\"delta\")\n.mode(\"overwrite\")\n.partitionBy(\"par\")\n.saveAsTable(\"delta_merge_into\")\nThen merge a DataFrame into the Delta table to create a table called\nupdate\n:\n%scala\nval updatesTableName = \"update\"\nval targetTableName = \"delta_merge_into\"\nval updates = spark.range(100).withColumn(\"id\", (rand() * 30000000 * 2).cast(IntegerType))\n.withColumn(\"par\", ($\"id\" % 2).cast(IntegerType))\n.withColumn(\"ts\", current_timestamp())\n.dropDuplicates(\"id\")\nupdates.createOrReplaceTempView(updatesTableName)\nThe\nupdate\ntable has 100 rows with three columns,\nid\n,\npar\n, and\nts\n. The value of\npar\nis always either 1 or 0.\nLet’s say you run the following simple\nMERGE INTO\nquery:\n%scala\nspark.sql(s\"\"\"\n|MERGE INTO $targetTableName\n|USING $updatesTableName\n|ON $targetTableName.id = $updatesTableName.id\n|WHEN MATCHED THEN\n|  UPDATE SET $targetTableName.ts = $updatesTableName.ts\n|WHEN NOT MATCHED THEN\n|  INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)\n\"\"\".stripMargin)\nThe query takes 13.16 minutes to complete:\nThe physical plan for this query contains\nPartitionCount: 1000\n, as shown below. This means Apache Spark is scanning all 1000 partitions in order to execute the query. This is not an efficient query, because the\nupdate\ndata only has partition values of\n1\nand\n0\n:\n== Physical Plan ==\n*(5) HashAggregate(keys=[], functions=[finalmerge_count(merge count#8452L) AS count(1)#8448L], output=[count#8449L])\n+- Exchange SinglePartition\n+- *(4) HashAggregate(keys=[], functions=[partial_count(1) AS count#8452L], output=[count#8452L])\n+- *(4) Project\n+- *(4) Filter (isnotnull(count#8440L) && (count#8440L > 1))\n+- *(4) HashAggregate(keys=[_row_id_#8399L], functions=[finalmerge_sum(merge sum#8454L) AS sum(cast(one#8434 as bigint))#8439L], output=[count#8440L])\n+- Exchange hashpartitioning(_row_id_#8399L, 200)\n+- *(3) HashAggregate(keys=[_row_id_#8399L], functions=[partial_sum(cast(one#8434 as bigint)) AS sum#8454L], output=[_row_id_#8399L, sum#8454L])\n+- *(3) Project [_row_id_#8399L, UDF(_file_name_#8404) AS one#8434]\n+- *(3) BroadcastHashJoin [cast(id#7514 as bigint)], [id#8390L], Inner, BuildLeft, false\n:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))\n:  +- *(2) HashAggregate(keys=[id#7514], functions=[], output=[id#7514])\n:     +- Exchange hashpartitioning(id#7514, 200)\n:        +- *(1) HashAggregate(keys=[id#7514], functions=[], output=[id#7514])\n:           +- *(1) Filter isnotnull(id#7514)\n:              +- *(1) Project [cast(((rand(8188829649009385616) * 3.0E7) * 2.0) as int) AS id#7514]\n:                 +- *(1) Range (0, 100, step=1, splits=36)\n+- *(3) Filter isnotnull(id#8390L)\n+- *(3) Project [id#8390L, _row_id_#8399L, input_file_name() AS _file_name_#8404]\n+- *(3) Project [id#8390L, monotonically_increasing_id() AS _row_id_#8399L]\n+- *(3) Project [id#8390L, par#8391, ts#8392]\n+- *(3) FileScan parquet [id#8390L,ts#8392,par#8391] Batched: true, DataFilters: [], Format: Parquet, Location: TahoeBatchFileIndex[dbfs:/user/hive/warehouse/delta_merge_into], PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct\nSolution\nRewrite the query to specify the partitions.\nThis\nMERGE INTO\nquery specifies the partitions directly:\n%scala\nspark.sql(s\"\"\"\n|MERGE INTO $targetTableName\n|USING $updatesTableName\n|ON $targetTableName.par IN (1,0) AND $targetTableName.id = $updatesTableName.id\n|WHEN MATCHED THEN\n|  UPDATE SET $targetTableName.ts = $updatesTableName.ts\n|WHEN NOT MATCHED THEN\n|  INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)\n\"\"\".stripMargin)\nNow the query takes just 20.54 seconds to complete on the same cluster:\nThe physical plan for this query contains\nPartitionCount: 2\n, as shown below. With only minor changes, the query is now more than 40X faster:\n== Physical Plan ==\n*(5) HashAggregate(keys=[], functions=[finalmerge_count(merge count#7892L) AS count(1)#7888L], output=[count#7889L])\n+- Exchange SinglePartition\n+- *(4) HashAggregate(keys=[], functions=[partial_count(1) AS count#7892L], output=[count#7892L])\n+- *(4) Project\n+- *(4) Filter (isnotnull(count#7880L) && (count#7880L > 1))\n+- *(4) HashAggregate(keys=[_row_id_#7839L], functions=[finalmerge_sum(merge sum#7894L) AS sum(cast(one#7874 as bigint))#7879L], output=[count#7880L])\n+- Exchange hashpartitioning(_row_id_#7839L, 200)\n+- *(3) HashAggregate(keys=[_row_id_#7839L], functions=[partial_sum(cast(one#7874 as bigint)) AS sum#7894L], output=[_row_id_#7839L, sum#7894L])\n+- *(3) Project [_row_id_#7839L, UDF(_file_name_#7844) AS one#7874]\n+- *(3) BroadcastHashJoin [cast(id#7514 as bigint)], [id#7830L], Inner, BuildLeft, false\n:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))\n:  +- *(2) HashAggregate(keys=[id#7514], functions=[], output=[id#7514])\n:     +- Exchange hashpartitioning(id#7514, 200)\n:        +- *(1) HashAggregate(keys=[id#7514], functions=[], output=[id#7514])\n:           +- *(1) Filter isnotnull(id#7514)\n:              +- *(1) Project [cast(((rand(8188829649009385616) * 3.0E7) * 2.0) as int) AS id#7514]\n:                 +- *(1) Range (0, 100, step=1, splits=36)\n+- *(3) Project [id#7830L, _row_id_#7839L, _file_name_#7844]\n+- *(3) Filter (par#7831 IN (1,0) && isnotnull(id#7830L))\n+- *(3) Project [id#7830L, par#7831, _row_id_#7839L, input_file_name() AS _file_name_#7844]\n+- *(3) Project [id#7830L, par#7831, monotonically_increasing_id() AS _row_id_#7839L]\n+- *(3) Project [id#7830L, par#7831, ts#7832]\n+- *(3) FileScan parquet [id#7830L,ts#7832,par#7831] Batched: true, DataFilters: [], Format: Parquet, Location: TahoeBatchFileIndex[dbfs:/user/hive/warehouse/delta_merge_into], PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: struct" +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-share-not-visible-in-shared-with-me-section.json b/scraped_kb_articles/delta-share-not-visible-in-shared-with-me-section.json new file mode 100644 index 0000000000000000000000000000000000000000..d500e18c59fbec79f72fe306a315e5ca281352b8 --- /dev/null +++ b/scraped_kb_articles/delta-share-not-visible-in-shared-with-me-section.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/delta-share-not-visible-in-shared-with-me-section", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen logging in as a Delta Sharing data recipient, you expect to see the share tables under the\nShared with me\nsection. Instead you see the following message.\nLooking for data shared with you? Contact your account administrator to grant you the USE_PROVIDER privilege to view all inbound shares.\nCause\nTo view tables from a Delta Sharing data provider, a metastore admin must grant you the\nUSE PROVIDER\nprivilege, or you must be a metastore admin.\nSolution\nHave the metastore admin use SQL in a notebook or the Databricks UI to grant the\nUSE_PROVIDER\nprivilege.\nUse SQL in a notebook\nRun the following SQL command in a notebook.\n%sql\r\nGRANT USE_PROVIDER ON METASTORE TO ``\nUse the Databricks UI\nIn your Recipient Databricks workspace:\nClick the\nCatalog\ntab.\nClick the gear icon and select\nMetastore\n.\nSelect the\nPermissions\ntab.\nClick the\nGrant\nbutton. Issue the\nUSE_PROVIDER\nprivilege to the desired user or group.\nFor more information, review the\nManage Delta Sharing providers (for data recipients)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-sharing-shared-table-appearing-empty-in-the-target-databricks-workspace.json b/scraped_kb_articles/delta-sharing-shared-table-appearing-empty-in-the-target-databricks-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..a0e878b0f443896717b864708c643b3d29d8bf85 --- /dev/null +++ b/scraped_kb_articles/delta-sharing-shared-table-appearing-empty-in-the-target-databricks-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-sharing-shared-table-appearing-empty-in-the-target-databricks-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Delta Sharing to share partitioned data from partitioned tables between Databricks workspaces. You notice the shared table in the target workspace appears empty except you can see column names. You further notice that all the data is visible after you remove the partition filter from the share.\nYou have already verified the table in the source workspace contains data, and the partitions are added to the share.\nCause\nThe partition column in the Delta Sharing partition filter is of\ndecimal(38,4)\ndata type, which includes trailing zeros.\nWhen you add a partition filter with a value such as\n“1234”\n, it does not match the actual data representation in the source workspace with the trailing zeros, which is\n“1234.0000”\n.\nThe mismatch causes data not to be returned.\nSolution\n1. Verify the data type and representation of the partition column in the source workspace.\n2. Upon adding a partition filter to the share, use the correct representation of the value including any trailing zeros (\n`1234.0000`\ninstead of\n`1234`\n).\n3. Use SQL to update the share to use the correct partition value with the correct representation.\nALTER SHARE ADD TABLE PARTITION ( = '1234.0000');" +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-table-as-a-streaming-source-returns-error-delta_file_not_found_detailed-even-though-no-user-or-lifecycle-rule-has-deleted-files.json b/scraped_kb_articles/delta-table-as-a-streaming-source-returns-error-delta_file_not_found_detailed-even-though-no-user-or-lifecycle-rule-has-deleted-files.json new file mode 100644 index 0000000000000000000000000000000000000000..e1210d00abf40d16ae52861ee1018f1cbe8ff64e --- /dev/null +++ b/scraped_kb_articles/delta-table-as-a-streaming-source-returns-error-delta_file_not_found_detailed-even-though-no-user-or-lifecycle-rule-has-deleted-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-table-as-a-streaming-source-returns-error-delta_file_not_found_detailed-even-though-no-user-or-lifecycle-rule-has-deleted-files", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Delta table stream fails to consume and you receive the following error message.\norg.apache.spark.SparkException: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file abfss: .dfs.core.windows.net//part-xxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet. [DELTA_FILE_NOT_FOUND_DETAILED] File abfss://.dfs.core.windows.net//part-xxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet referenced in the transaction log cannot be found.\nCause\nIf the file was not manually deleted, the issue may stem from the Delta table’s last successful consumption time relative to the\nVACUUM\noperation time.\nThe\nVACUUM\noperation is triggered daily and removes files older than seven days. Files marked for deletion on the Delta transaction logs are deleted. When this happens, the latest offset file under the checkpoint may still reference the deleted files, resulting in\nDBR_FILE_NOT_EXIST\nerrors.\nThis is most common with Delta tables that include non-append operations such as\nOPTIMIZE\ncombined with periodic\nVACUUM\ncommands.\nSolution\nThere are three options available. If you are able, combining option two and three is best.\n1. Consume the source table from scratch, essentially streaming from the Delta table as a new consumer to a new location with a new checkpoint.\nNote\nThis method can sometimes create challenges for downstream consumers. For instance, if the consumer writes events to Kafka, it could result in duplicate events.\n2. Force the consumer to skip missing versions by starting fresh and consuming from a specific version. (Do not delete the metadata files for missing versions. This results in a different error.)\nUse the following code to set a new checkpoint location and read from the earliest available version of the data.\nstreaming_df = spark.readStream \\\r\n    .format(\"delta\") \\\r\n    .option(\"startingVersion\", \"20\") \\ \r\n    .load(delta_table_path)\r\n\r\nquery = streaming_df.writeStream \\ \r\n   .outputMode(\"append\") \\\r\n      .format(\"delta\") \\\r\n     .option(\"checkpointLocation\", ”/”) \\ \r\n     .start(output_delta_table_path)\n3. If your data includes a timestamp field, load the data from scratch using a filter based on a specific timestamp. Determine the timestamp when the consumer last successfully processed records. Then, restart consumption from scratch, applying a filter to process only records with a timestamp greater than or equal to that value.\nstreaming_df = spark.readStream \\\r\n .format(\"delta\") \\\r\n .load(delta_table_path) \\\r\n.filter(f\"timestamp_field >= \"\")  \r\nquery = streaming_df.writeStream \\ \r\n   .outputMode(\"append\") \\\r\n      .format(\"delta\") \\\r\n     .option(\"checkpointLocation\", ”/”) \\ \r\n     .start(output_delta_table_path)" +} \ No newline at end of file diff --git a/scraped_kb_articles/delta-writing-empty-files-when-source-is-empty.json b/scraped_kb_articles/delta-writing-empty-files-when-source-is-empty.json new file mode 100644 index 0000000000000000000000000000000000000000..71e11981ca4c1a73310d91be63d6879506201278 --- /dev/null +++ b/scraped_kb_articles/delta-writing-empty-files-when-source-is-empty.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta-writing-empty-files-when-source-is-empty", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDelta writes can result in the creation of empty files if the source is empty. This can happen with a regular Delta write or a\nMERGE INTO\n(\nAWS\n|\nAzure\n|\nGCP\n) operation.\nIf your streaming application is writing to a target Delta table and your source data is empty on certain micro batches, it can result in writing empty files to your target Delta table.\nWriting empty files to a Delta table should be avoided as they can cause performance issues (ex. too many small files, multiple unnecessary commits, etc.). If there were too many commits happening at a high frequency (either due to a very large inflow of high frequency events and/or due to a low streaming trigger frequency configuration), then it could result in too many small files on the target delta table. This too many small empty files can increase the overall listing time and thereby could hamper the subsequent read performance.\nCause\nThe writing of empty files is a known issue in Databricks Runtime 7.3 LTS. Empty writes create additional files as well as new versions in Delta.\nIf there are 1000 empty writes in a day you see 1000 empty files created which accumulate over time. Even a table with just three records can result in several thousand empty files, depending on how frequently writes are performed.\nFor example, in this sample Delta commit,\nnumOutputRows\nis\n0\n, however\nnumTargetFilesAdded\nis\n1\n. This means it has added one file, even though there are no output rows.\nOperation - Write\r\n {\"numFiles\":\"1\",\"numOutputBytes\":\"2675\",\"numOutputRows\":\"0\"} OperationParameters{\"mode\":\"Append\",\"partitionBy\":\"[]\"}\r\n\r\nOperation - Merge\r\n{\"numOutputRows\":\"0\",\"numSourceRows\":\"0\",\"numTargetFilesAdded\":\"1\",\"numTargetFilesRemoved\":\"0\",\"numTargetRowsCopied\":\"0\",\"numTargetRowsDeleted\":\"0\",\"numTargetRowsInserted\":\"0\",\"numTargetRowsUpdated\":\"0\"}\nSolution\nYou should upgrade your clusters to Databricks Runtime 9.1 LTS or above.\nDatabricks Runtime 9.1 LTS and above contains a fix for the issue and no longer creates empty files for empty writes.\nDelete\nInfo\nIf you cannot upgrade to Databricks Runtime 9.1 LTS or above, you should periodically run\nOPTIMIZE\n(\nAWS\n|\nAzure\n|\nGCP\n) on the affected table to clean up the empty files. This is not a permanent fix and should be considered a workaround until you can upgrade to a newer runtime." +} \ No newline at end of file diff --git a/scraped_kb_articles/delta_clustering_column_missing_stats-error-when-attempting-to-define-liquid-clustering-for-a-delta-table.json b/scraped_kb_articles/delta_clustering_column_missing_stats-error-when-attempting-to-define-liquid-clustering-for-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..a3a80f7fdcba1a1d437609139af037f1a77184df --- /dev/null +++ b/scraped_kb_articles/delta_clustering_column_missing_stats-error-when-attempting-to-define-liquid-clustering-for-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/delta_clustering_column_missing_stats-error-when-attempting-to-define-liquid-clustering-for-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re attempting to run an SQL command to enable liquid clustering on your existing Delta table.\nALTER TABLE CLUSTER BY ()\nThe command results in an error.\n[DELTA_CLUSTERING_COLUMN_MISSING_STATS] Liquid clustering requires clustering columns to have stats. Couldn't find clustering column(s) \nCause\nThe table is missing Delta statistics for the specific set of columns\n\nused in the\nCLUSTER BY\nclause.\nSolution\nTo generate Delta statistics, run the following\nANALYZE\ncommand. This command generates Delta statistics on the first 32 columns defined in your table schema.\nANALYZE TABLE COMPUTE DELTA STATISTICS\nTo select specific columns (\ncolumn1,column2,column3\n) for Delta statistics generation, and to reduce the\nANALYZE\nexecution time, set the following configuration before running\nANALYZE\n. This set can also be the same as the clustering keys.\nALTER TABLE \r\nSET TBLPROPERTIES (\r\n  'delta.dataSkippingStatsColumns' = 'column1,column2,column3'\r\n);" +} \ No newline at end of file diff --git a/scraped_kb_articles/deltafilenotfoundexception-when-reading-a-table.json b/scraped_kb_articles/deltafilenotfoundexception-when-reading-a-table.json new file mode 100644 index 0000000000000000000000000000000000000000..0d8ef70a68361bd043883dbc854841042d59cf49 --- /dev/null +++ b/scraped_kb_articles/deltafilenotfoundexception-when-reading-a-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/deltafilenotfoundexception-when-reading-a-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to read a Delta table when you get a\nDeltaFileNotFoundException\nerror.\ncom.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: dbfs:/VolumesOrMount/path/to/deltaTable/_delta_log/00000000000000000000.json: Unable to reconstruct state at version {version} as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)\nCause\nDeltaFileNotFoundException\noccurs when Delta Lake cannot access the transaction log files or checkpoint files required to reconstruct the state of the table at a specified version.\nThis can be caused by manual deletion of log files, retention policies, or storage lifecycle policies.\nManual deletion of log files\nYou have manually deleted files in the\n_delta_log\ndirectory, either intentionally or unintentionally, resulting in an incomplete transaction log history.\nRetention policies\nDelta Lake enforces retention policies to manage the size of the transaction logs and checkpoint files:\nCheckpoint Retention (\ndelta.checkpointRetentionDuration\n): Default is 2 days.\nTransaction Log Retention (\ndelta.logRetentionDuration\n): Default is 30 days.\nIf the requested version exceeds the retention period, the files may no longer exist, which results in a\nDeltaFileNotFoundException\nerror.\nStorage lifecycle policies\nYour underlying cloud storage service may have lifecycle policies that delete or archive files before the Delta retention policies expire. As a result, the requested files no longer exist.\nSolution\nTo resolve this issue, you can either recreate the table or attempt to recover the missing files.\nInfo\nIf you have\nS3 Versioning enabled\non AWS,\nSoft Delete enabled\non Azure,\nSoft Delete enabled\non GCP, or a similar backup mechanism that periodically saves a copy of the files, you should be able to recover your files.\nTo prevent this issue from occurring, you should take steps to prevent manual deletion of files in the\n_delta_log\ndirectory. The log files are important for maintaining table consistency.\nYou can also increase the retention duration if you regularly require access to older versions or perform historical queries.\nFinally, you should review the storage policies on your underlying cloud storage account to ensure files are not deleted prematurely. Your cloud storage settings should align with your Delta Lake retention policy settings." +} \ No newline at end of file diff --git a/scraped_kb_articles/deltainvariantviolationexception-exceeds-charvarchar-type-length-limitation-error-when-writing-a-delta-table.json b/scraped_kb_articles/deltainvariantviolationexception-exceeds-charvarchar-type-length-limitation-error-when-writing-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..cd633dbc45321948557747469d40f7f757b8d607 --- /dev/null +++ b/scraped_kb_articles/deltainvariantviolationexception-exceeds-charvarchar-type-length-limitation-error-when-writing-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/deltainvariantviolationexception-exceeds-charvarchar-type-length-limitation-error-when-writing-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to write a Delta table in Delta Lake, you encounter an error indicating the data exceeds the char/varchar type length limitation.\ncom.databricks.sql.transaction.tahoe.schema.DeltaInvariantViolationException: [DELTA_EXCEED_CHAR_VARCHAR_LIMIT] Exceeds char/varchar type length limitation. Failed check: (expression)\nCause\nThere is a mismatch between the length of the data being processed and the defined length in the schema. In Databricks Runtime versions 11.3 LTS and above, strings are treated as fixed-length character types (\nCHAR\n) with a maximum length of 255 characters.\nWhen defining a schema with\nVARCHAR\nor\nCHAR\ntypes, Databricks Runtime strictly enforces the specified length constraints.\nSolution\n1. Navigate to the Databricks workspace and select the cluster where the notebook is running.\n2. Click the\nEdit\nbutton to modify the cluster configuration.\n3. Scroll down to the\nSpark Config\nsection and add the following configuration.\nspark.sql.legacy.charVarcharAsString true\n4. Save the changes and restart the cluster.\nNote\nThe configuration\nspark.sql.legacy.charVarcharAsString\nin Apache Spark is used to control how\nCHAR\nand\nVARCHAR\ntypes are handled.\nWhen this configuration is set to true,\nCHAR\nand\nVARCHAR\ntypes are treated as\nSTRING\ntypes in Spark. This can help avoid issues related to strict length limitations and padding that are typically associated with\nCHAR\nand\nVARCHAR\ntypes." +} \ No newline at end of file diff --git a/scraped_kb_articles/deploying-databricks-asset-bundles-through-a-cicd-pipeline-fails-with-403-error.json b/scraped_kb_articles/deploying-databricks-asset-bundles-through-a-cicd-pipeline-fails-with-403-error.json new file mode 100644 index 0000000000000000000000000000000000000000..858c2313c347e9352e8da664f58b462f50e814bd --- /dev/null +++ b/scraped_kb_articles/deploying-databricks-asset-bundles-through-a-cicd-pipeline-fails-with-403-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/deploying-databricks-asset-bundles-through-a-cicd-pipeline-fails-with-403-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen deploying Databricks Asset Bundles (DABs) through the CICD pipeline, the process fails with a 403 error, despite successful manual deployments from the Linux server.\nError message\n403 Forbidden/\")\nFor further reading and best practices on distinguishing between workspace and DBFS paths, refer to the\nWhat are workspace files?\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nDatabricks Utilities (\ndbutils\n) reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/directory_protected-error-when-deleting-a-users-home-folder.json b/scraped_kb_articles/directory_protected-error-when-deleting-a-users-home-folder.json new file mode 100644 index 0000000000000000000000000000000000000000..1837e09f293e7e976d346048b6e2a709aca8e20d --- /dev/null +++ b/scraped_kb_articles/directory_protected-error-when-deleting-a-users-home-folder.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/directory_protected-error-when-deleting-a-users-home-folder", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/disable-broadcast-when-broadcastnestedloopjoin.json b/scraped_kb_articles/disable-broadcast-when-broadcastnestedloopjoin.json new file mode 100644 index 0000000000000000000000000000000000000000..d6ca07660963bcd3548794c6747c5e620a125792 --- /dev/null +++ b/scraped_kb_articles/disable-broadcast-when-broadcastnestedloopjoin.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/disable-broadcast-when-broadcastnestedloopjoin", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to disable broadcast when the query plan has\nBroadcastNestedLoopJoin\nin the physical plan.\nYou expect the broadcast to stop after you disable the broadcast threshold, by setting\nspark.sql.autoBroadcastJoinThreshold\nto\n-1\n, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error.\nThis behavior is NOT a bug, however it can be unexpected. We are going to review the expected behavior and provide a mitigation option for this issue.\nCreate tables\nStart by creating two tables, one with null values\ntable_withNull\nand the other without null values\ntblA_NoNull\n.\n%sql\r\n\r\nsql(\"SELECT id FROM RANGE(10)\").write.mode(\"overwrite\").saveAsTable(\"tblA_NoNull\")\r\nsql(\"SELECT id FROM RANGE(50) UNION SELECT NULL\").write.mode(\"overwrite\").saveAsTable(\"table_withNull\")\nAttempt to disable broadcast\nWe attempt to disable broadcast by setting\nspark.sql.autoBroadcastJoinThreshold\nfor the query, which has a sub-query with an\nin\nclause.\n%sql\r\n\r\nspark.conf.set(\"spark.sql.autoBroadcastJoinThreshold\", -1)\r\nsql(\"select * from table_withNull where id not in (select id from tblA_NoNull)\").explain(true)\nIf you review the query plan,\nBroadcastNestedLoopJoin\nis the last possible fallback in this situation. It appears even after attempting to disable the broadcast.\n== Physical Plan ==\r\n*(2) BroadcastNestedLoopJoin BuildRight, LeftAnti, ((id#2482L = id#2483L) || isnull((id#2482L = id#2483L)))\r\n:- *(2) FileScan parquet default.table_withnull[id#2482L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/table_withnull], PartitionFilters: [], PushedFilters: [], ReadSchema: struct\r\n+- BroadcastExchange IdentityBroadcastMode, [id=#2586]\r\n   +- *(1) FileScan parquet default.tbla_nonull[id#2483L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/tbla_nonull], PartitionFilters: [], PushedFilters: [], ReadSchema: struct\nIf the data being processed is large enough, this results in broadcast errors when Spark attempts to broadcast the table.\nRewrite query using\nnot exists\ninstead of\nin\nYou can resolve the issue by rewriting the query with\nnot exists\ninstead of\nin\n.\n%sql\r\n\r\n// It can be rewritten into a NOT EXISTS, which will become a regular join:\r\nsql(\"select * from table_withNull where not exists (select 1 from tblA_NoNull where table_withNull.id = tblA_NoNull.id)\").explain(true)\nBy using\nnot exists\n, the query runs with\nSortMergeJoin\n.\n== Physical Plan ==\r\nSortMergeJoin [id#2482L], [id#2483L], LeftAnti\r\n:- Sort [id#2482L ASC NULLS FIRST], false, 0\r\n:  +- Exchange hashpartitioning(id#2482L, 200), [id=#2653]\r\n:     +- *(1) FileScan parquet default.table_withnull[id#2482L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/table_withnull], PartitionFilters: [], PushedFilters: [], ReadSchema: struct\r\n+- Sort [id#2483L ASC NULLS FIRST], false, 0\r\n   +- Exchange hashpartitioning(id#2483L, 200), [id=#2656]\r\n      +- *(2) Project [id#2483L]\r\n         +- *(2) Filter isnotnull(id#2483L)\r\n            +- *(2) FileScan parquet default.tbla_nonull[id#2483L] Batched: true, DataFilters: [isnotnull(id#2483L)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/user/hive/warehouse/tbla_nonull], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct\nExplanation\nSpark doesn’t do this automatically, because Spark and SQL have slightly different semantics for null handling.\nIn SQL,\nnot in\nmeans that if there is any null value in the\nnot in\nvalues, the result is empty. This is why it can only be executed with\nBroadcastNestedLoopJoin\n. All\nnot in\nvalues must be known in order to ensure there is no null value in the set.\nExample notebook\nThis notebook has a complete example, showing why Spark does not automatically switch\nBroadcastNestedLoopJoin\nto\nSortMergeJoin\n.\nReview the\nBroadcastNestedLoopJoin\nexample notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/disable-cluster-scoped-init-scripts-on-dbfs.json b/scraped_kb_articles/disable-cluster-scoped-init-scripts-on-dbfs.json new file mode 100644 index 0000000000000000000000000000000000000000..6ab869ebc56b23bedaa5c8778625957053073acb --- /dev/null +++ b/scraped_kb_articles/disable-cluster-scoped-init-scripts-on-dbfs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/disable-cluster-scoped-init-scripts-on-dbfs", + "title": "Título do Artigo Desconhecido", + "content": "On May 2, 2023 Databricks announced that cluster-scoped init scripts stored on DBFS are deprecated. Cluster-scoped init scripts should be stored as workplace files.\nYou can prevent users from launching clusters using cluster-scoped init scripts stored on DBFS by setting a cluster policy.\nInstructions\nWarning\nYou must be a Databricks admin to apply cluster policies.\n1. Follow the documentation to\nCreate a cluster policy\n(\nAWS\n|\nAzure\n|\nGCP\n).\n2. Add the following to the cluster policy:\n\"init_scripts.*.dbfs.destination\": {\r\n \"type\": \"forbidden\"\r\n}\n3. Apply the policy to all users.\nAfter the policy has been applied, users will no longer be able to create clusters that use init scripts loaded from DBFS." +} \ No newline at end of file diff --git a/scraped_kb_articles/disable-or-restrict-access-to-the-foundation-model-apis.json b/scraped_kb_articles/disable-or-restrict-access-to-the-foundation-model-apis.json new file mode 100644 index 0000000000000000000000000000000000000000..8d326312175e521ce770e195f2be71972c5c3398 --- /dev/null +++ b/scraped_kb_articles/disable-or-restrict-access-to-the-foundation-model-apis.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/disable-or-restrict-access-to-the-foundation-model-apis", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nNow that Mosaic AI Model Serving supports Foundation Model APIs, users within the workspace can access and query advanced open models by default. However, there may be instances where you need to restrict access to certain serving endpoints to prevent unauthorized usage.\nCause\nThe default configuration grants all users within the workspace access to these models, which could lead to potential misuse or unintended access if not properly controlled.\nSolution\nTo prevent users from accessing or using a specific endpoint, administrators can effectively \"disable\" the endpoint by setting its rate limits to zero. Navigate to the endpoint's detailed settings page and adjust both the request and response limits to 0.\nImportant\nOnly a workspace administrator has the authority to modify these settings. Once the limits are set to zero, any attempt to query the endpoint will result in a 403 permission denied error, ensuring that access is blocked.\nAn unauthorized user will receive the following error message.\n{\"error_code\":\"PERMISSION_DENIED\",\"message\":\"PERMISSION_DENIED: The endpoint is disabled due to a rate limit set to 0.\"}\nFor more information regarding model serving endpoints, please refer to the\nModel serving with Databricks\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/display-file-timestamp-details.json b/scraped_kb_articles/display-file-timestamp-details.json new file mode 100644 index 0000000000000000000000000000000000000000..31026e9a07071977036b27635d8af899a3bd3edd --- /dev/null +++ b/scraped_kb_articles/display-file-timestamp-details.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/display-file-timestamp-details", + "title": "Título do Artigo Desconhecido", + "content": "In this article we show you how to display detailed timestamps, including the date and time when a file was created or modified.\nUse ls command\nThe simplest way to display file timestamps is to use the\nls -lt \ncommand in a bash shell.\nFor example, this sample command displays basic timestamps for files and directories in the\n/dbfs/\nfolder.\n%sh\r\n\r\nls -lt /dbfs/\nOutput:\ntotal 36\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 FileStore\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 databricks\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 databricks-datasets\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 databricks-results\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 ml\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 tmp\r\ndrwxrwxrwx 2 root root 4096 Jul  1 12:49 user\r\ndrwxrwxrwx 2 root root 4096 Jun  9  2020 dbfs\r\ndrwxrwxrwx 2 root root 4096 May 20  2020 local_disk0\nUse Python commands to display creation date and modification date\nThe\nls\ncommand is an easy way to display basic information. If you want more detailed timestamps, you should use Python API calls.\nFor example, this sample code uses\ndatetime\nfunctions to display the creation date and modified date of all listed files and directories in the\n/dbfs/\nfolder. Replace\n/dbfs/\nwith the full path to the files you want to display.\n%python\r\n\r\nimport os\r\nfrom datetime import datetime\r\npath = '/dbfs/'\r\nfdpaths = [path+\"/\"+fd for fd in os.listdir(path)]\r\nprint(\" file_path \" + \" create_date \" + \" modified_date \")\r\nfor fdpath in fdpaths:\r\n  statinfo = os.stat(fdpath)\r\n  create_date = datetime.fromtimestamp(statinfo.st_ctime)\r\n  modified_date = datetime.fromtimestamp(statinfo.st_mtime)\r\n  print(fdpath, create_date, modified_date)\nOutput:\nfile_path  create_date  modified_date\r\n/dbfs//FileStore 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730\r\n/dbfs//databricks 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730\r\n/dbfs//databricks-datasets 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730\r\n/dbfs//databricks-results 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730\r\n/dbfs//dbfs 2020-06-09 21:11:24 2020-06-09 21:11:24\r\n/dbfs//local_disk0 2020-05-20 22:32:05 2020-05-20 22:32:05\r\n/dbfs//ml 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730\r\n/dbfs//tmp 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730\r\n/dbfs//user 2021-07-01 12:49:45.264730 2021-07-01 12:49:45.264730" +} \ No newline at end of file diff --git a/scraped_kb_articles/display-not-show-microseconds.json b/scraped_kb_articles/display-not-show-microseconds.json new file mode 100644 index 0000000000000000000000000000000000000000..ad8cf401306d35f89b7e4f26647d0cdcc7fcccfd --- /dev/null +++ b/scraped_kb_articles/display-not-show-microseconds.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/display-not-show-microseconds", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to display a timestamp value with microsecond precision, but when you use\ndisplay()\nit does not show the value past milliseconds.\nFor example, this Apache Spark SQL\ndisplay()\ncommand:\n%sql\r\n\r\ndisplay(spark.sql(\"select cast('2021-08-10T09:08:56.740436' as timestamp) as test\"))\nReturns a truncated value:\n2021-08-10T09:08:56.740+0000\nCause\nThe DataFrame is converted to HTML internally before the output is rendered.\nThis limits the displayed results to millisecond precision.\nIt does not affect the stored value.\nSolution\nYou should use\nshow()\ninstead of using\ndisplay()\n.\nFor example, this Apache Spark SQL\nshow()\ncommand:\n%sql\r\n\r\nspark.sql(\"select cast('2021-08-10T09:08:56.740436' as timestamp) as test\").show(truncate=False)\nReturns the correct value:\n2021-08-10 09:08:56.740436\nAs an alternative, you can create a second column and copy the value to the column as a string.\nAfter conversion to a string,\ndisplay()\nshows the full value." +} \ No newline at end of file diff --git a/scraped_kb_articles/display-null-as-nan.json b/scraped_kb_articles/display-null-as-nan.json new file mode 100644 index 0000000000000000000000000000000000000000..1863ee8866bd8907544dd091c40c50168095a80f --- /dev/null +++ b/scraped_kb_articles/display-null-as-nan.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/display-null-as-nan", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a table with null values in some columns. When you query the table using a select statement in Databricks, the null values appear as null.\nWhen you query the table using the same select statement in Databricks SQL, the null values appear as NaN.\n%sql\r\n\r\nselect * from default. where is null\nDatabricks\nDatabricks SQL\nCause\nNaN is short for not a number. This is how null values are displayed in Databricks SQL.\nSolution\nThis is not a problem. Databricks SQL is working as designed.\nThe representation of null values in Databricks SQL is different from the representation of null values in Databricks, but the data itself is not changed." +} \ No newline at end of file diff --git a/scraped_kb_articles/dlt-pipeline-failing-with-concurrentdeletedeleteexception-error.json b/scraped_kb_articles/dlt-pipeline-failing-with-concurrentdeletedeleteexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..d18864387902418d8939f62efd87eba555719d8c --- /dev/null +++ b/scraped_kb_articles/dlt-pipeline-failing-with-concurrentdeletedeleteexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/dlt-pipeline-failing-with-concurrentdeletedeleteexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour DLT pipeline update fails with the following error while trying to execute the\napply_changes\nfunction.\nio.delta.exceptions.ConcurrentDeleteDeleteException: [DELTA_CONCURRENT_DELETE_DELETE] ConcurrentDeleteDeleteException: This transaction attempted to delete one or more files that were deleted (for example XXXX.snappy.parquet) by a concurrent update. Please try the operation again.\nCause\nThere is a write conflict between any of\nMERGE\n,\nUPDATE\nor\nDELETE\ncommands executed by the pipeline\napply_changes\nfunction and the\nOPTIMIZE with ZORDER\noperation, which runs as part of regular DLT maintenance.\nEven when deletion vectors (row-level concurrency) are enabled, they do not fully eliminate such conflicts. This is expected behavior. DLT performs maintenance tasks within 24 hours of a table being updated. By default, the system performs a full\nOPTIMIZE\noperation with\nZORDERBY\nif specified followed by\nVACUUM\n.\nFor more information, review the\nIsolation levels and write conflicts on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nSet the table property\ndelta.enableRowTracking\nto\ntrue\nto enable row tracking on the table. This property helps reduce concurrency errors within DLT pipelines when an\nOPTIMIZE with ZORDER\noperation triggered by the DLT maintenance pipeline conflicts with\nMERGE\n,\nUPDATE\n, or\nDELETE\noperations.\nSpecify the config as a table property in the DLT table definition.\ndlt.create_streaming_table( name=\"\", \r\ntable_properties={ \"delta.enableRowTracking\": \"true\" })\r\n\r\ndlt.apply_changes()\nFor more information, review the\nUse row tracking for Delta tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/dlt-pipeline-fails-with-error-dlt-error-code-execution_service_startup_failure.json b/scraped_kb_articles/dlt-pipeline-fails-with-error-dlt-error-code-execution_service_startup_failure.json new file mode 100644 index 0000000000000000000000000000000000000000..ce0b8f6c6dd08fd532274e46a8194f1732237c3b --- /dev/null +++ b/scraped_kb_articles/dlt-pipeline-fails-with-error-dlt-error-code-execution_service_startup_failure.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/dlt-pipeline-fails-with-error-dlt-error-code-execution_service_startup_failure", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a recently rotated, deleted or expired service principal secret with Auto Loader or Delta Live Tables pipelines, you encounter an error message.\ncom.databricks.pipelines.common.CustomException: [DLT ERROR CODE: EXECUTION_SERVICE_STARTUP_FAILURE] HTTP Error 401; url='https://login.microsoftonline.com//oauth2/token' AADToken: HTTP connection to failed for getting token from AzureAD.; requestId=''; contentType='application/json; charset=utf-8'; response '{\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app ''.\"\nCause\nThe token was renewed or rotated when rotating, deleting, or expiring a service principal secret. This causes authentication failure with Microsoft Entra ID (Azure Active Directory).\nSolution\nGenerate a new Microsoft Entra ID service principal secret in the Azure portal or Azure CLI.\nUpdate the secret in the Databricks secret scope using the Databricks CLI or UI.\nUnmount and remount the Azure storage mount points in the Databricks workspace using the new secret, otherwise updated secret will not be picked up.\nUse the\ndbutils.fs.refreshMounts()\ncommand in the DLT code to force the DLT cluster to pick up the latest configuration.\nFor more information, review the\nConnect to Azure Data Lake Storage Gen2 and Blob Storage\nand\nMounting cloud object storage on Azure Databricks\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/dlt-pipeline-is-very-slow-when-using-auto-loader-and-a-glob-filter.json b/scraped_kb_articles/dlt-pipeline-is-very-slow-when-using-auto-loader-and-a-glob-filter.json new file mode 100644 index 0000000000000000000000000000000000000000..4e3cb46c33a61f875c8ebd69cbb4e64b6f6d9ef7 --- /dev/null +++ b/scraped_kb_articles/dlt-pipeline-is-very-slow-when-using-auto-loader-and-a-glob-filter.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/dlt-pipeline-is-very-slow-when-using-auto-loader-and-a-glob-filter", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA Delta Live Table (DLT) pipeline using Auto Loader appears to hang or process data very slowly when handling large datasets with a Glob filter in the file path.\nCause\nThis behavior arises because Auto Loader's directory listing mode processes files by first listing all objects in the specified path before applying the Glob filter. This approach requires the entire directory, including all subfolders, to be scanned, even if the Glob pattern excludes many of them.\nCloud providers do not allow filtering during the listing phase. As a result, Auto Loader must first discover all files and then apply the Glob filter to include or exclude files.\nWhen the directory contains millions of files, this listing process can take a significant amount of time and is executed in a single thread by default. This delay prevents timely updates to micro-batches. For example, if the source directory contained over 500 million files, excessive backfill times could occur as the directory listing process could not filter out unnecessary files early.\nSolution\nFile notification mode (recommended)\nDatabricks recommends enabling file notification mode when using Auto Loader. This mode bypasses the need for directory scanning by leveraging cloud-native event notifications to detect new files and eliminates directory listing delays.\nReview the\nWhat is Auto Loader file notification mode?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to learn how to configure file notification mode.\nDirectory listing mode\nIf file notification mode is not feasible, consider the following mitigations to improve performance in directory listing mode:\nDisable asynchronous directory listing for backfilling.\nThe Apache Spark configuration\nspark.databricks.cloudFiles.asyncDirListing\ndetermines how backfilling operations handle directory listings in Databricks. Disabling this configuration (\nspark.databricks.cloudFiles.asyncDirListing = false\n) can distribute the listing tasks across executors, potentially speeding up the listing process in scenarios where backfilling is taking too long.\nYou will have to measure the tradeoff in terms of performance with your specific workloads to determine if this is indeed beneficial.\nPartition your source directories. Instead of applying a broad Glob filter over a large directory tree, divide the data into smaller, more manageable partitions.\nFor example:\nCreate separate streams for each specific path or time-based partition.\nConsolidate data from multiple streams after processing.\nPeriodically archive processed files. Move or archive processed files to another location to reduce the size of the source directory and improve listing times for future runs.\nThese steps ensure that micro-batches are processed efficiently, reducing delays and avoiding the appearance of a \"stuck\" pipeline." +} \ No newline at end of file diff --git a/scraped_kb_articles/dlt-table-names-auto-renaming-with-schema-id-prefix-and-not-accessible.json b/scraped_kb_articles/dlt-table-names-auto-renaming-with-schema-id-prefix-and-not-accessible.json new file mode 100644 index 0000000000000000000000000000000000000000..6ea884e13791fcb9520e43a44a59e05b60f98d87 --- /dev/null +++ b/scraped_kb_articles/dlt-table-names-auto-renaming-with-schema-id-prefix-and-not-accessible.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/dlt-table-names-auto-renaming-with-schema-id-prefix-and-not-accessible", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are unable to access your original Delta Live Table (DLT) tables.\nWhen you navigate to the\nCatalog\ntab of the UI, you notice the message\n“Table does not exist”\nis displayed.\nWhen you check the DLT UI (in\nJobs & Pipelines\n) you see the table names are automatically changing from\n\nto a format prefixed with the schema ID, such as\n___\n.\nCause\nIn your DLT pipeline configuration, you have the\ntemporary\nparameter set to\nTrue\n.\nWhen\ntemporary\nis set to\nTrue\n, DLT treats the tables as temporary assets, which are not intended for access outside the pipeline.\nSolution\nIn a notebook, you can set the\ntemporary\nvariable option to\nFalse\nwhile calling DLT decorator.\n@(\r\n    temporary=False\r\n)\r\n\nAlternatively, also in a notebook, you can remove the temporary variable option while calling DLT decorator.\n@\r\n\r\n" +} \ No newline at end of file diff --git a/scraped_kb_articles/dltimportexception-error-when-importing-the-dlt-module.json b/scraped_kb_articles/dltimportexception-error-when-importing-the-dlt-module.json new file mode 100644 index 0000000000000000000000000000000000000000..3e449c258697c57a565c765a86163858be459be3 --- /dev/null +++ b/scraped_kb_articles/dltimportexception-error-when-importing-the-dlt-module.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/dltimportexception-error-when-importing-the-dlt-module", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to import the Delta Live Tables (DLT) module, you encounter the following error.\n“DLTImportException: Delta Live Tables module is not supported on Spark Connect clusters.”\nCause\nThe file\n/databricks/python_shell/dbruntime/PostImportHook.py\nin the Databricks environment overrides all\ndlt import\nstatements to support Delta Live Tables. However, this creates a naming conflict because\ndlt\nis also the name of an open-source PyPI package that certain Python libraries depend on. The issue arises because the Databricks Runtime import hook bypasses the try/except block typically used by these libraries to handle imports gracefully, leading to an import conflict.\nSolution\nYou can address this issue by using a cluster-scoped init script targeting a specific job or cell commands in a notebook.\nUse an init script\nUse the workspace file browser to\ncreate a new file\n(\nAWS\n|\nAzure\n|\nGCP\n) in your home directory. Call it\nremovedbdlt.sh\n.\nOpen the\nremovedbdlt.sh\nfile.\nCopy and paste this init script into\nrepos.sh\n.\n#!/bin/bash\r\nrm -rf /databricks/spark/python/dlt\nFollow the documentation to\nconfigure a cluster-scoped init script\n(\nAWS\n|\nAzure\n|\nGCP\n) as a workspace file.\nSpecify the path to the init script. Since you created\nremovedbdlt.sh\nin your home directory, the path should look like\n/Users//removedbdlt.sh\n.\nAfter configuring the init script, restart the cluster.\nUse cell commands\nRun the following commands in a notebook cell.\n%sh\r\nimport site\r\nimport sys\r\nsys.path.insert(0, site.getsitepackages()[0])\r\n%pip install dlt\r\nimport dlt\r\ndlt.__version__" +} \ No newline at end of file diff --git a/scraped_kb_articles/dns-resolution-fails-for-a-newly-created-databricks-workspace.json b/scraped_kb_articles/dns-resolution-fails-for-a-newly-created-databricks-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..8ee575cc82a586c71a940c74a8b2ec421a1434df --- /dev/null +++ b/scraped_kb_articles/dns-resolution-fails-for-a-newly-created-databricks-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/dns-resolution-fails-for-a-newly-created-databricks-workspace", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/double-data-type-defined-columns-lose-precision-for-values-exceeding-15-significant-digits.json b/scraped_kb_articles/double-data-type-defined-columns-lose-precision-for-values-exceeding-15-significant-digits.json new file mode 100644 index 0000000000000000000000000000000000000000..cd1fdbb9e000422a93209da1f03fb55e63266117 --- /dev/null +++ b/scraped_kb_articles/double-data-type-defined-columns-lose-precision-for-values-exceeding-15-significant-digits.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/double-data-type-defined-columns-lose-precision-for-values-exceeding-15-significant-digits", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using a Databricks SQL environment where columns defined with the\nDOUBLE\ndata type lose precision when storing or comparing numeric values that exceed 15 significant digits. As a result, numeric values are silently rounded or truncated, leading to incorrect data storage, inaccurate query results, and unexpected behavior in equality comparisons or joins.\nCause\nThe\nDOUBLE\ndata type in Databricks SQL is based on the IEEE 754 double-precision floating-point standard, which uses 64 bits to represent a numeric value. Of these 64 bits, only 53 are used to represent the significant digits, which limits the precision to approximately 15 significant decimal digits based on the numeric value.\nThis means that when you store a numeric value with more than 15 digits—especially large integers or high-precision identifiers—the\nDOUBLE\ntype instead rounds the value to the nearest representable binary approximation.\nSolution\nSwitch the data type from\nDOUBLE\nto\nDECIMAL\nto allow the table to store and query values with higher precision.\nFor example, changing the column to\nDECIMAL(38,0)\nallows the table to store and query values with up to 38 digits of precision. This ensures that the complete digits are preserved and displayed accurately.\nTo change the data type of the column:\n1. Connect to your Databricks SQL environment.\n2. Locate the affected Delta table and identify the column with the\nDOUBLE\ndata type.\n3. Run the following SQL command to alter the column data type to\nDECIMAL(p,s)\nwhere\np\nand\ns\nare the required precision and scale required for the application.\n```sql\r\nALTER TABLE .\r\nALTER COLUMN id TYPE DECIMAL(p,s); -- p and s are the required precision and scale for the application\r\n```\n4. Verify the change by querying the table and checking the data type of the column." +} \ No newline at end of file diff --git a/scraped_kb_articles/download-files-from-dbfs-with-the-web-browser.json b/scraped_kb_articles/download-files-from-dbfs-with-the-web-browser.json new file mode 100644 index 0000000000000000000000000000000000000000..f6e54d42dc63a66be6e6fd6dfb7f362061a01dd4 --- /dev/null +++ b/scraped_kb_articles/download-files-from-dbfs-with-the-web-browser.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/download-files-from-dbfs-with-the-web-browser", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to download files stored on DBFS root directly from the web. This can be useful when working with files like logs that are being generated and stored in DBFS, such as log delivery, init script logs, or tcp dumps.\nCause\nYou cannot download files directly from DBFS if\nDBFS File Browser\nis disabled in the workspace admin settings.\nSolution\nTo download files from DBFS via the web browser:\nVerify that a workspace admin has enabled DBFS File Browser. Follow the steps in the\nManage the DBFS file browser\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nClick\nCatalog\nin the left nav bar.\nClick\nBrowse DBFS\n.\nGo to the location where the file is located\nRight click the file you want to download.\nClick\nMove\n.\nMove the file to the\nFileStore\nfolder.\nOnce the file is in the FileStore folder, you can access it with a custom URL. Enter the following URL into your browser, where\n\n(\nAWS\n|\nAzure\n|\nGCP\n) is the Databricks workspace URL and\n\nis the name of the file you want to download:\nhttps:///files/\nSubmit the URL to download the file." +} \ No newline at end of file diff --git a/scraped_kb_articles/driver-unavailable.json b/scraped_kb_articles/driver-unavailable.json new file mode 100644 index 0000000000000000000000000000000000000000..1ffe3c0f2121db9ddd198f9f33a6bac43180b437 --- /dev/null +++ b/scraped_kb_articles/driver-unavailable.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/driver-unavailable", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running notebooks or jobs on a cluster, they run successfully multiple times, but sometimes the driver stops working and error messages will display, such as:\nDriver is temporarily unavailable.\nThe spark driver has stopped unexpectedly and is restarting.\nLost connection to cluster. The notebook may have been detached.\nIf you check the\ncluster event logs\n, you'll find the\nDriver_Not_Responding\nevent with a message related to garbage collection (GC):\nDriver is up but is not responsive, likely due to GC.\nIf you check the\nGanglia metrics\nwhen the issue occurs, you will notice that the driver node experiencing a high load (for example, showing an orange/red color).\nTo get the driver’s IP so that you can filter on it in the Ganglia metrics dashboard, you can navigate to the cluster’s Spark cluster UI > Master tab and get the IP of the driver (the Spark Master) from the first line: `Spark Master at spark://x.x.x.x:port`.\nCause\nOne common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver crashes with an out of memory (OOM) condition and gets restarted or becomes unresponsive due to frequent full garbage collection. The reason for the memory bottleneck can be any of the following:\nThe driver instance type is not optimal for the load executed on the driver.\nThere are memory-intensive operations executed on the driver.\nThere are many notebooks or jobs running in parallel on the same cluster.\nSolution\nThe solution varies from case to case. The easiest way to resolve the issue in the absence of specific details is to increase the driver memory. You can increase driver memory simply by upgrading the driver node type on the cluster edit page in your Databricks workspace.\nOther points to consider:\nAvoid memory intensive operations like:\ncollect() operator, which brings a large amount of data to the driver.\nConversion of a large DataFrame to Pandas DataFrame using the toPandas() function.\nIf these operations are essential, ensure that enough driver memory is available; otherwise, look for alternatives that can parallelize the execution of your code. For example, use Spark instead of Pandas for data processing, and Spark ML instead of regular Python machine-learning libraries (for example, scikit-learn).\nAvoid running batch jobs on a shared interactive cluster.\nDistribute the workloads into different clusters. No matter how big the cluster is, the functionalities of the Spark driver cannot be distributed within a cluster.\nDo a periodic restart of the interactive cluster (on daily basis for example) during low loads to clear out any remaining objects in the memory from previous runs. You can use the cluster's\nrestart  REST API endpoint\nalong with your favorite automation tool to automate this.\nRun the specific notebook in isolation on a cluster to evaluate exactly how much memory is required to execute the notebook successfully." +} \ No newline at end of file diff --git a/scraped_kb_articles/drop-database-no-delete.json b/scraped_kb_articles/drop-database-no-delete.json new file mode 100644 index 0000000000000000000000000000000000000000..0011ef0faa1481c5f31f6f97b865898d0be24d68 --- /dev/null +++ b/scraped_kb_articles/drop-database-no-delete.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/drop-database-no-delete", + "title": "Título do Artigo Desconhecido", + "content": "By default, the\nDROP DATABASE\n(\nAWS\n|\nAzure\n|\nGCP\n) command drops the database and deletes the directory associated with the database from the file system.\nSometimes you may want to drop the database, but keep the underlying database directory intact.\nExample code\nYou can use this example code to drop the database without dropping the underlying storage folder.\n%scala\r\n\r\nimport scala.collection.JavaConverters._\r\nimport org.apache.hadoop.hive.ql.metadata.Hive\r\nimport org.apache.hadoop.hive.conf.HiveConf\r\nimport org.apache.hadoop.hive.ql.session.SessionState\r\n\r\nval hiveConf = new HiveConf(classOf[SessionState])\r\nsc.hadoopConfiguration.iterator().asScala.foreach { kv =>\r\nhiveConf.set(kv.getKey, kv.getValue)\r\n}\r\nsc.getConf.getAll.foreach {\r\ncase (k, v) => hiveConf.set(k, v)\r\n}\r\n\r\nhiveConf.setBoolean(\"hive.cbo.enable\", false)\r\nval state = new SessionState(hiveConf)\r\nval hive = Hive.get(state.getConf)\r\nprintln(state.getConf)\r\n\r\nhive.dropDatabase(\"\", false, false, true)\nFor more information on\norg.apache.hadoop.hive.ql.metadata.Hive\n, please review the\nHive documentation\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/drop-delta-table.json b/scraped_kb_articles/drop-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..a1a7dea26c111cd0ce0f9f5e900db779b111634a --- /dev/null +++ b/scraped_kb_articles/drop-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/drop-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Regardless of how you drop a managed table, it can take a significant amount of time, depending on the data size. Delta Lake managed tables in particular contain a lot of metadata in the form of transaction logs, and they can contain duplicate data files. If a Delta table has been in use for a long time, it can accumulate a very large amount of data.\nIn the Databricks environment, there are two ways to drop tables (\nAWS\n|\nAzure\n|\nGCP\n):\nRun\nDROP TABLE\nin a notebook cell.\nClick\nDelete\nin the UI.\nEven though you can delete tables in the background without affecting workloads, it is always good to make sure that you run\nDELETE FROM\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nVACUUM\n(\nAWS\n|\nAzure\n|\nGCP\n) before you start a drop command on any table. This ensures that the metadata and file sizes are cleaned up before you initiate the actual data deletion.\nFor example, if you are trying to delete the Delta table\nevents\n, run the following commands before you start the\nDROP TABLE\ncommand:\nRun DELETE FROM:\nDELETE FROM events\nRun VACUUM with an interval of zero:\nVACUUM events RETAIN 0 HOURS\nThese two steps reduce the amount of metadata and number of uncommitted files that would otherwise increase the data deletion time." +} \ No newline at end of file diff --git a/scraped_kb_articles/drop-table-corruptedmetadata.json b/scraped_kb_articles/drop-table-corruptedmetadata.json new file mode 100644 index 0000000000000000000000000000000000000000..5da1e19180300b96b800333170ea30a5ea9a5980 --- /dev/null +++ b/scraped_kb_articles/drop-table-corruptedmetadata.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/drop-table-corruptedmetadata", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nSometimes you cannot drop a table from the Databricks UI. Using\n%sql\nor\nspark.sql\nto drop table doesn’t work either.\nCause\nThe metadata (table schema) stored in the metastore is corrupted. When you run\nDrop table\ncommand, Spark checks whether table exists or not before dropping the table. Since the metadata is corrupted for the table Spark can’t drop the table and fails with following exception.\n%scala\r\n\r\ncom.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: The metadata is corrupted\nSolution\nUse a Hive client to drop the table since the Hive client doesn’t check for the table existence as Spark does. To drop a table:\nCreate a function inside Hive package.\n%scala\r\n\r\npackage org.apache.spark.sql.hive {\r\nimport org.apache.spark.sql.hive.HiveUtils\r\nimport org.apache.spark.SparkContext\r\n\r\nobject utils {\r\n    def dropTable(sc: SparkContext, dbName: String, tableName: String, ignoreIfNotExists: Boolean, purge: Boolean): Unit = {\r\n      HiveUtils\r\n          .newClientForMetadata(sc.getConf, sc.hadoopConfiguration)\r\n          .dropTable(dbName, tableName, ignoreIfNotExists, false)\r\n    }\r\n  }\r\n}\nDrop corrupted tables.\n%scala\r\n\r\nimport org.apache.spark.sql.hive.utils\r\nutils.dropTable(sc, \"default\", \"my_table\", true, true)" +} \ No newline at end of file diff --git a/scraped_kb_articles/drop-table-exception-azure-metastore.json b/scraped_kb_articles/drop-table-exception-azure-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..240629cb3139b617a17c93f213db20a9206abc2e --- /dev/null +++ b/scraped_kb_articles/drop-table-exception-azure-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/drop-table-exception-azure-metastore", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to drop a table in an external Hive version 2.0 or 2.1 metastore that is deployed on Azure SQL Database, Databricks throws the following exception:\ncom.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Exception thrown when executing query : SELECT 'org.apache.hadoop.hive.metastore.model.MStorageDescriptor' AS NUCLEUS_TYPE,A0.INPUT_FORMAT,A0.IS_COMPRESSED,A0.IS_STOREDASSUBDIRECTORIES,A0.LOCATION,A0.NUM_BUCKETS,A0.OUTPUT_FORMAT,A0.SD_ID FROM SDS A0 WHERE A0.CD_ID = ? OFFSET 0 ROWS FETCH NEXT ROW ONLY );\r\n  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:107)\r\n  at org.apache.spark.sql.hive.HiveExternalCatalog.doDropTable(HiveExternalCatalog.scala:483)\r\n  at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.dropTable(ExternalCatalog.scala:122)\r\n  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropTable(SessionCatalog.scala:638)\r\n  at org.apache.spark.sql.execution.command.DropTableCommand.run(ddl.scala:212)\nCause\nThis is a known Hive bug (\nHIVE-14698\n), caused by another known bug with the\ndatanucleus-rdbms\nmodule in the package. It is fixed in\ndatanucleus-rdbms\n4.1.16. However, Hive 2.0 and 2.1 metastores use version 4.1.7 and these versions are affected.\nSolution\nDo one of the following:\nUpgrade the Hive metastore to version 2.3.0. This also resolves problems due to any other Hive bug that is fixed in version 2.3.0.\nImport the following notebook to your workspace and follow the instructions to replace the\ndatanucleus-rdbms\nJAR. This notebook is written to upgrade the metastore to version 2.1.1. You might want to have a similar version in your server side.\nExternal metastore upgrade notebook\nReview the\nExternal metastore upgrade notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/dropping-and-recreating-delta-tables-results-in-a-deltaversionsnotcontiguousexception-error.json b/scraped_kb_articles/dropping-and-recreating-delta-tables-results-in-a-deltaversionsnotcontiguousexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..cf3b0cae3fed293721e3a04ab1b254f690bdd15f --- /dev/null +++ b/scraped_kb_articles/dropping-and-recreating-delta-tables-results-in-a-deltaversionsnotcontiguousexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/dropping-and-recreating-delta-tables-results-in-a-deltaversionsnotcontiguousexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Warning\nThis article does NOT apply to managed tables. Managed table operations should always be performed through Unity Catalog. For more information, review the\nWork with managed tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nProblem\nWhen working with Delta Lake tables you encounter an error message.\n\"java.lang.IllegalStateException: Versions (Vector(0, 3)) are not contiguous\" typically occurs when there's a lack of continuity in the Delta log files, which can happen when files have been manually removed or due to S3 eventual consistency when a table is deleted and recreated at the same location.\nCause\nThe Delta table is corrupted. Corrupted tables can occur if you:\nManually remove underlying files from the Delta log.\nRun rm commands or other non-Delta operations that remove files from the Delta log.\nDrop and immediately create a table on top of the same location.\nWhen files are manually removed or not removed correctly, the Delta log versions become non-contiguous.\nSolution\nUse the following Scala code to get the corrupted table storage location.\n%scala\r\nval metastore = spark.sharedState.externalCatalog\r\nval location = metastore.getTable(\"\", \"\").location\nRemove the table base folder.\n%sh\r\nrm -r /dbfs/\nDrop the table.\n%sql\r\ndrop table \nUse\nCREATE OR REPLACE TABLE\nto recreate the table if the underlying location remains the same. For more information, please review the\nCREATE TABLE [USING]\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAlternatively, if you only have append operations (without\nOPTIMIZE\n) on your Delta table, retrieve the Delta table without dropping it and convert it.\nRemove the\n_delta_log\nfolder. Without\n_delta_log\n, this will be treated as a parquet table.\n%sh\r\nrm -r /dbfs//_delta_log\nConvert this parquet table to Delta table. The following command will create a fresh\n_delta_log\nfolder so the table can be queryable without losing the data.\n%sql\r\nCONVERT TO DELTA parquet.`/` PARTITIONED BY (year string) [if table is partitioned use PARTITIONED BY];\nFor more information on best practices for dropping Delta Lake tables, please review the\nBest practices for dropping a managed Delta Lake table\nknowledge base article." +} \ No newline at end of file diff --git a/scraped_kb_articles/ds_resource_not_found_on_ds_server-error-when-a-data-recipient-tries-to-access-views-from-a-dedicated-cluster.json b/scraped_kb_articles/ds_resource_not_found_on_ds_server-error-when-a-data-recipient-tries-to-access-views-from-a-dedicated-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..59f62611b81d04421625130e480291c1d2509473 --- /dev/null +++ b/scraped_kb_articles/ds_resource_not_found_on_ds_server-error-when-a-data-recipient-tries-to-access-views-from-a-dedicated-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/ds_resource_not_found_on_ds_server-error-when-a-data-recipient-tries-to-access-views-from-a-dedicated-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAs a data recipient in Delta Sharing, you are able to access both tables and views from an SQL warehouse or a standard cluster, but you notice you can only access the tables from a dedicated cluster, not the views. You receive the following error.\nHTTP request failed with status: HTTP/1.1 404 Not Found for query(Some(307774b4)).  {\"error_code\":\"NOT_FOUND\",\"message\":\"DS_RESOURCE_NOT_FOUND_ON_DS_SERVER: Resource not found on sharing server\\nEndpoint:\nhttps:///api/2.0/delta-sharing/metastores//shares//schemas//tables//query\n\\nMethod: POST\\nHTTP Code: 404\\nStatus Line: 404\\nBody: {\\\"error_code\\\":\\\"TABLE_DOES_NOT_EXIST\\\",\\\"message\\\":\\\"Table '..__dsff_materialization__XXXXXX' does not exist.\nCause\nThe materialization of the views occurs on the workspace where the recipient was created, but the recipient does not have access to the underlying share due to catalog bindings.\nThis behavior discrepancy is due to the way Delta sharing handles materialization and access control for shared catalogs.\nSolution\nCreate a TOKEN recipient on the data provider side. Navigate to the\nDelta Sharing\npage >\nShared by me\ntab >\nRecipients\n>\nNew recipient\n>\nCreate Open recipient with Token authentication\n.\nShare the TOKEN profile with the recipient. Use the activation link generated from Step 1 to download the TOKEN profile (credential file)\nHave the recipient create a data provider using the TOKEN profile. On the Delta Sharing providers page, under the\nShared with me\ntab, click the\nImport data\nbutton on the righthand side. The following screenshot shows the UI with the\nImport data\nbutton location and style.\nMount the catalog using the share." +} \ No newline at end of file diff --git a/scraped_kb_articles/dstream-not-supported.json b/scraped_kb_articles/dstream-not-supported.json new file mode 100644 index 0000000000000000000000000000000000000000..2a79ead89b8340b6b77ca45ae700cf52c2e079d2 --- /dev/null +++ b/scraped_kb_articles/dstream-not-supported.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/dstream-not-supported", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to use a Spark Discretized Stream (DStream) in a Databricks streaming job, but the job is failing.\nCause\nDStreams and the DStream API are not supported by Databricks.\nSolution\nInstead of using Spark DStream, you should migrate to Structured Streaming.\nReview the Databricks Structured Streaming in production (\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/dump-table.json b/scraped_kb_articles/dump-table.json new file mode 100644 index 0000000000000000000000000000000000000000..01c339adcd7efa014101c5dad2bcbd015cecec4a --- /dev/null +++ b/scraped_kb_articles/dump-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/dump-table", + "title": "Título do Artigo Desconhecido", + "content": "You want to send results of your computations in Databricks outside Databricks. You can use BI tools to connect to your cluster via JDBC and export results from the BI tools, or save your tables in DBFS or blob storage and copy the data via REST API.\nThis article introduces JSpark, a simple console tool for executing SQL queries using JDBC on Spark clusters to dump remote tables to local disk in CSV, JSON, XML, Text, and HTML format.\nFor example:\n%sh\r\n\r\njava -Dconfig.file=mycluster.conf -jar jspark.jar -q \"select id, type, priority, status from tickets limit 5\"\nreturns:\n+----+--------+--------+------+\r\n|  id|type    |priority|status|\r\n+----+--------+--------+------+\r\n|9120|problem |urgent  |closed|\r\n|9121|question|normal  |hold  |\r\n|9122|incident|normal  |closed|\r\n|9123|question|normal  |open  |\r\n|9124|incident|normal  |solved|\r\n+----+--------+--------+------+\nInstructions for use, example usage, source code, and a link to the assembled JAR is available at the\nJSpark GitHub repo\n.\nYou can specify the parameters of JDBC connection using arguments or using a config file, for example:\nmycluster.conf\n.\nTo check or troubleshoot JDBC connections, download the fat JAR\njspark.jar\nand launch it as a regular JAR. It includes hive-jdbc 1.2.1 and all required dependencies." +} \ No newline at end of file diff --git a/scraped_kb_articles/dupe-column-in-metadata.json b/scraped_kb_articles/dupe-column-in-metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..915c92ca7daf3b3793757833d93cb3c4af448d75 --- /dev/null +++ b/scraped_kb_articles/dupe-column-in-metadata.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/dupe-column-in-metadata", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Apache Spark job is processing a Delta table when the job fails with an error message.\norg.apache.spark.sql.AnalysisException: Found duplicate column(s) in the metadata update: col1, col2...\nCause\nThere are duplicate column names in the Delta table. Column names that differ only by case are considered duplicate.\nDelta Lake is case preserving, but case insensitive, when storing a schema.\nParquet is case sensitive when storing and returning column information.\nSpark can be case sensitive, but it is case insensitive by default.\nIn order to avoid potential data corruption or data loss, duplicate column names are not allowed.\nSolution\nDelta tables must not contain duplicate column names.\nEnsure that all column names are unique." +} \ No newline at end of file diff --git a/scraped_kb_articles/duplicate-tag-keys-are-not-allowed-when-updating-table-tag-keys.json b/scraped_kb_articles/duplicate-tag-keys-are-not-allowed-when-updating-table-tag-keys.json new file mode 100644 index 0000000000000000000000000000000000000000..4dc56ec1fbe0725e4bb0d04dcf48c8e7fbe1f233 --- /dev/null +++ b/scraped_kb_articles/duplicate-tag-keys-are-not-allowed-when-updating-table-tag-keys.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/duplicate-tag-keys-are-not-allowed-when-updating-table-tag-keys", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAs of November 19, 2024, you notice your jobs are setting a tag for a specific table according to the following structure.\nALTER TABLE customer.profile.activity_by_guid \r\nSET TAGS('LOB' = 'Non-Specific')\nThis same\nALTER TABLE\nworked without issue before November 19. When you check your driver logs, you see the following exception.\n2024/11/19 19:06:16.270 WARN com.databricks.elasticspark.runoutput.PersistedRunOutputHelper$[tenant=XXXX traceId=XXXX spanId=XXXX requestId=XXXX serverEventId=XXXX parentServerEventId=XXXX thread=jobsApiStorageThreadPool-47]: Failed to parse [RequestId=XXXX ErrorClass=INVALID_PARAMETER_VALUE.INVALID_PARAMETER_VALUE] Duplicate tag keys are not allowed\r\nJVM stacktrace:\r\ncom.databricks.sql.managedcatalog.UnityCatalogServiceException\nCause\nDatabricks started enabling case sensitivity in workspaces as of November 19, 2024. All new tag keys created need to be lower case. All existing tags need to be converted to lower case.\nThis policy change is why\n“LOB”\nfails with the\nSET TAG\ncommand, even though\n“LOB”\nexists on the table.\nSolution\nUse lower case when setting tag keys.\nALTER TABLE customer.profile.activity_by_guid \r\nSET TAGS ('lob' = 'Non-Specific')" +} \ No newline at end of file diff --git a/scraped_kb_articles/duplicates-appearing-in-auto-loader-with-file-notification-feature-despite-set-backfill-interval.json b/scraped_kb_articles/duplicates-appearing-in-auto-loader-with-file-notification-feature-despite-set-backfill-interval.json new file mode 100644 index 0000000000000000000000000000000000000000..28e127203ebc36c28479e48b06beaff69efb3427 --- /dev/null +++ b/scraped_kb_articles/duplicates-appearing-in-auto-loader-with-file-notification-feature-despite-set-backfill-interval.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/duplicates-appearing-in-auto-loader-with-file-notification-feature-despite-set-backfill-interval", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Auto Loader in file notification mode to ingest files, you notice the cluster receives a significantly higher number of messages during certain intervals, leading to duplicate data in the resulting dataset.\nThis behavior is observed despite a set backfill interval and smooth incoming traffic.\nCause\nUsing\nBackfillInterval\n, notification mode, and\nallowOverwrites\nin combination is known to cause duplicates.\nIt’s also possible the same file is being repeatedly processed and overwriting existing entries in\ncloud_files_state\n, which also leads to duplicates.\nSolution\nRemove the\nallowOverwrites\nconfiguration or implement a deduplicate logic downstream if you want to reprocess overwritten files at the source.\nUse\ncloud_files_state\nto identify any files that have been processed more than once.\nFor more information on\ncloud_files_state\n, review the\ncloud_files_state\ntable-valued function\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/edited-policy-not-applied.json b/scraped_kb_articles/edited-policy-not-applied.json new file mode 100644 index 0000000000000000000000000000000000000000..303fef4dc5fd98b7871352e4bc6d99e8fb8a3285 --- /dev/null +++ b/scraped_kb_articles/edited-policy-not-applied.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/edited-policy-not-applied", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to update an existing\ncluster policy\n, however the update does not apply to the cluster associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not applied or saved.\nCause\nThis is a known issue that is being addressed.\nSolution\nYou can use a workaround until a permanent fix is available.\nEdit the cluster policy.\nRe-attribute the policy to\nFree form\n.\nAdd the edited policy back to the cluster.\nIf you want to edit a cluster that is associated with a policy:\nTerminate the cluster.\nAssociate a different policy to the cluster.\nEdit the cluster.\nRe-associate the original policy to the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/empty-string-values-convert-to-null-values-when-saving-a-table-as-csv-or-text-based-file-format.json b/scraped_kb_articles/empty-string-values-convert-to-null-values-when-saving-a-table-as-csv-or-text-based-file-format.json new file mode 100644 index 0000000000000000000000000000000000000000..e3b0b4e5fe4e7f68f0cfd069055513be6395da18 --- /dev/null +++ b/scraped_kb_articles/empty-string-values-convert-to-null-values-when-saving-a-table-as-csv-or-text-based-file-format.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/empty-string-values-convert-to-null-values-when-saving-a-table-as-csv-or-text-based-file-format", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen saving a table as a CSV file or other text-based format, your empty string values are replaced with\nNULL\nvalues.\nCause\nEmpty string values are interpreted as\nNULL\nduring the Serialization/Deserialization (SerDe) step when saving tables as CSV files or other text-based data formats which do not have a defined schema present.\nNULL\nvalues may also appear when performing joins on tables if the join condition is not met.\nSolution\nSave your data in Delta format instead of CSV or text-based formats. Delta tables handle empty strings and\nNULL\nvalues more effectively, ensuring that empty strings are preserved during data insertion.\nIf you need to use CSV format, ensure:\nThe options being used to both read and write can correctly handle Empty /\nNULL\nvalues.\nAny external systems reading the file can properly serialize the data as intended.\nFor more information, please review the Apache Spark\nCSV Files\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/enable-gcm-cipher.json b/scraped_kb_articles/enable-gcm-cipher.json new file mode 100644 index 0000000000000000000000000000000000000000..20812506b7c5acca0b0074601ec635df5cc43539 --- /dev/null +++ b/scraped_kb_articles/enable-gcm-cipher.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/enable-gcm-cipher", + "title": "Título do Artigo Desconhecido", + "content": "Databricks clusters using Databricks Runtime 9.1 LTS and below do not have GCM (Galois/Counter Mode) cipher suites enabled by default.\nYou must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher suites.\nDelete\nInfo\nThis article applies to clusters using Databricks Runtime 7.3 LTS and 9.1 LTS. Databricks Runtime 10.4 LTS and above have GCM cipher suites enabled by default.\nVerify required cipher suites\nUse the\nnmap\nutility to verify which cipher suites are required by the external server.\n%sh\r\n\r\nnmap --script ssl-enum-ciphers -p \nDelete\nNote\nIf\nnmap\nis not installed, run\nsudo apt-get install -y nmap\nto install it on your cluster.\nCreate an init script to enable GCM cipher suites\nUse the example code to create an init script that enables GCM cipher suites on your cluster.\n%python\r\n\r\ndbutils.fs.put(\"//enable-gcm.sh\", \"\"\"#!/bin/bash\r\nsed -i 's/, GCM//g' /databricks/spark/dbconf/java/extra.security\r\n\"\"\",True)\n%scala\r\n\r\ndbutils.fs.put(\"//enable-gcm.sh\", \"\"\"#!/bin/bash\r\nsed -i 's/, GCM//g' /databricks/spark/dbconf/java/extra.security\r\n\"\"\",true)\nRemember the path to the init script. You will need it when configuring your cluster.\nConfigure cluster with init script\nFollow the documentation to configure a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nYou must specify the path to the init script.\nAfter configuring the init script, restart the cluster.\nVerify that GCM cipher suites are enabled\nThis example code queries the cluster for all supported cipher suites and then prints the output.\n%scala\r\n\r\nimport java.util.Map;\r\nimport java.util.TreeMap;\r\nimport javax.net.ssl.SSLServerSocketFactory\r\nimport javax.net.ssl._\r\nSSLContext.getDefault.getDefaultSSLParameters.getProtocols.foreach(println)\r\nSSLContext.getDefault.getDefaultSSLParameters.getCipherSuites.foreach(println)\nIf the GCM cipher suites are enabled, you see the following AES-GCM ciphers listed in the output.\nTLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384\r\nTLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256\r\nTLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384\r\nTLS_RSA_WITH_AES_256_GCM_SHA384\r\nTLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384\r\nTLS_ECDH_RSA_WITH_AES_256_GCM_SHA384\r\nTLS_DHE_RSA_WITH_AES_256_GCM_SHA384\r\nTLS_DHE_DSS_WITH_AES_256_GCM_SHA384\r\nTLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256\r\nTLS_RSA_WITH_AES_128_GCM_SHA256\r\nTLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256\r\nTLS_ECDH_RSA_WITH_AES_128_GCM_SHA256\r\nTLS_DHE_RSA_WITH_AES_128_GCM_SHA256\r\nTLS_DHE_DSS_WITH_AES_128_GCM_SHA256\nConnect to the external server\nOnce you have verified that GCM cipher suites are installed on your cluster, make a connection to the external server." +} \ No newline at end of file diff --git a/scraped_kb_articles/enable-openjsse-tls13.json b/scraped_kb_articles/enable-openjsse-tls13.json new file mode 100644 index 0000000000000000000000000000000000000000..8a200753339ff5eabdb8441991423039d0ed09c7 --- /dev/null +++ b/scraped_kb_articles/enable-openjsse-tls13.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/enable-openjsse-tls13", + "title": "Título do Artigo Desconhecido", + "content": "Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is not encrypted.\nIf you require that data is encrypted at all times, you can\nencrypt traffic between cluster worker nodes\nusing AES 128 over a TLS 1.2 connection.\nIn some cases, you may want to use TLS 1.3 instead of TLS 1.2 because it allows for stronger ciphers.\nTo use TLS 1.3 on your clusters, you must enable\nOpenJSSE\nin the cluster’s Apache\nSpark configuration\n.\nAdd\nspark.driver.extraJavaOptions -XX:+UseOpenJSSE\nto your\nSpark Config\n.\nRestart your cluster.\nOpenJSSE and TLS 1.3 are now enabled on your cluster and can be used in notebooks." +} \ No newline at end of file diff --git a/scraped_kb_articles/enable-retry-init-script.json b/scraped_kb_articles/enable-retry-init-script.json new file mode 100644 index 0000000000000000000000000000000000000000..f4ee9edfdc527bfb5673804e3d36449bb794c384 --- /dev/null +++ b/scraped_kb_articles/enable-retry-init-script.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/enable-retry-init-script", + "title": "Título do Artigo Desconhecido", + "content": "Init scripts are commonly used to configure Databricks clusters.\nThere are some scenarios where you may want to implement retries in an init script.\nExample init script\nThis sample init script shows you how to implement a retry for a basic copy operation.\nYou can use this sample code as a base for implementing retries in your own init script.\n%scala\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//retry-example-init.sh\", \"\"\"#!/bin/bash\r\n\r\necho \"starting script at `date`\"\r\n\r\nfunction fail {\r\n  echo $1 >&2\r\n  exit 1\r\n}\r\n\r\nfunction retry {\r\n  local n=1\r\n  local max=5\r\n  local delay=5\r\n  while true; do\r\n    \"$@\" && break || {\r\n      if [[ $n -lt $max ]]; then\r\n        ((n++))\r\n        echo \"Command failed. Attempt $n/$max: `date`\"\r\n        sleep $delay;\r\n      else\r\n        echo \"Collecting additional info for debugging..\"\r\n        ps aux > /tmp/ps_info.txt \r\n        debug_log_file=debug_logs_${HOSTNAME}_$(date +\"%Y-%m-%d--%H-%M\").zip\r\n        zip -r /tmp/${debug_log_file} /var/log/ /tmp/ps_info.txt /databricks/data/logs/\r\n        cp /tmp/${debug_log_file} /dbfs/tmp/\r\n        fail \"The command has failed after $n attempts. `date`\"\r\n      fi\r\n    }\r\n  done\r\n}\r\n\r\nsleep 15s\r\necho \"starting Copying at `date`\"\r\nretry cp -rv /dbfs/libraries/xyz.jar /databricks/jars/\r\n\r\necho \"Finished script at `date`\"\r\n\"\"\", true)" +} \ No newline at end of file diff --git a/scraped_kb_articles/enabling-dynamic-allocation-leads-to-nodes_lost-scenario.json b/scraped_kb_articles/enabling-dynamic-allocation-leads-to-nodes_lost-scenario.json new file mode 100644 index 0000000000000000000000000000000000000000..7836565f3d029e10b97883277ea5382528ed3317 --- /dev/null +++ b/scraped_kb_articles/enabling-dynamic-allocation-leads-to-nodes_lost-scenario.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/enabling-dynamic-allocation-leads-to-nodes_lost-scenario", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you enable dynamic allocation by setting\nspark.dynamicAllocation.enabled\nto\ntrue\n, you experience unexpected\nNODES_LOST\nscenarios.\nIn your event logs you see the following message.\nEVENT TYPE : NODES LOST\r\nMESSAGE : Compute lost at least one node. Reason: Communication lost\nAnd in your backend cluster logs you see the following error message.\nTerminateInstances worker_env_id: \"workerenv-XXXXXXXXXXX\"\r\ninstance_ids: \"i-XXXXXX\"\r\ninstance_termination_reason_code: LOST_EXECUTOR_DETECTED\nCause\nDynamic allocation is an Apache Spark feature exclusive to YARN and is not supported in Databricks environments. Databricks clusters are instead managed by Databricks Autoscaling.\nSolution\nEnable Autoscaling when you create a Databricks cluster. For more information, review the “Enable autoscaling” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/encountering-an-error-when-trying-to-connect-to-sybase-db-using-secure-port-10996.json b/scraped_kb_articles/encountering-an-error-when-trying-to-connect-to-sybase-db-using-secure-port-10996.json new file mode 100644 index 0000000000000000000000000000000000000000..b66be763654a584f78358fb49371225abea0e0ac --- /dev/null +++ b/scraped_kb_articles/encountering-an-error-when-trying-to-connect-to-sybase-db-using-secure-port-10996.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/encountering-an-error-when-trying-to-connect-to-sybase-db-using-secure-port-10996", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to connect to a Sybase DB using secure port 10996 from any Databricks workspaces, you receive the following error.\nPy4JJavaError: An error occurred while calling o422.load.\r\n: java.sql.SQLException: JZ00L: Login failed. Examine the SQLWarnings chained to this exception for the reason(s).\r\n  at com.sybase.jdbc4.jdbc.SybConnection.getAllExceptions(Unknown Source)\r\n  at com.sybase.jdbc4.jdbc.SybConnection.handleSQLE(Unknown Source)\r\n  at com.sybase.jdbc4.jdbc.SybConnection.a(Unknown Source)\r\n  at com.sybase.jdbc4.jdbc.SybConnection.handleHAFailover(Unknown Source)\r\n  at com.sybase.jdbc4.jdbc.SybConnection.(Unknown Source)\r\n  at com.sybase.jdbc4.jdbc.SybConnection.(Unknown Source)\r\n  at com.sybase.jdbc4.jdbc.SybDriver.connect(Unknown Source)\nCause\nThe JDBC connection string is missing the necessary SSL configuration parameters to establish a secure connection to the Sybase DB.\nSolution\nModify the JDBC connection string to include the necessary SSL configuration parameters.\nUpdate the JDBC\nserver_url\nvariable to include the\nENABLE_SSL=true\nand\nSSL_TRUST_ALL_CERTS=true\nparameters.\nUse the following format for the connection string as an example.\njdbc:sybase:Tds::/?ENABLE_SSL=true&SSL_TRUST_ALL_CERTS=true\nFor more information on enabling SSL connections, refer to the Sybase\nEnabling SSL connections\ndocumentation. You can also refer to the\nHow to connect to Sybase ASE using JDBC driver and SSL connection\narticle on Stack overflow." +} \ No newline at end of file diff --git a/scraped_kb_articles/ensure-consistency-in-statistics-functions-between-spark-30-and-spark-31-and-above.json b/scraped_kb_articles/ensure-consistency-in-statistics-functions-between-spark-30-and-spark-31-and-above.json new file mode 100644 index 0000000000000000000000000000000000000000..61366732a032348e1ff568b973701ace76e49a2e --- /dev/null +++ b/scraped_kb_articles/ensure-consistency-in-statistics-functions-between-spark-30-and-spark-31-and-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/ensure-consistency-in-statistics-functions-between-spark-30-and-spark-31-and-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe statistics functions\ncovar_samp\n,\nkurtosis\n,\nskewness\n,\nstd\n,\nstddev\n,\nstddev_samp\n,\nvariance\n, and\nvar_samp\n, return\nNaN\nwhen a d\nivide by zero occurs during expression evaluation in Databricks Runtime 7.3 LTS.\nThe same functions return\nnull\nin Databricks Runtime 9.1 LTS and above, as well as Databricks SQL endpoints when a divide by zero occurs during expression evaluation.\nThis example image shows sample results when running on Databricks Runtime 7.3 LTS. In cases where divide by zero occurs, the result is returned as NaN.\nThis example image shows sample results when running on Databricks Runtime 9.1 LTS. In cases where divide by zero occurs, the result is returned as null.\nCause\nThe change in behavior is due to an underlying change in Apache Spark.\nIn Spark 3.0 and below, the default behavior returns NaN when divided by zero occurs while evaluating a statistics function.\nIn Spark 3.1 this was changed to return null when divided by zero occurs while evaluating a statistics function.\nFor more information on the change, please review Spark PR\n[\nSPARK-13860]\n.\nSolution\nSet\nspark.sql.legacy.statisticalAggregate\nto\nfalse\nin your\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n) on clusters running Databricks Runtime 7.3 LTS.\nThis returns null instead of NaN when a divide by zero occurs while evaluating a statistics function.\nDelete\nInfo\nYou can also set this value at the notebook level using\nspark.conf.set(\"spark.sql.legacy.statisticalAggregate\", \"false\")\nif you don't have the ability to edit the cluster's\nSpark config\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/errno95-operation-not-supported.json b/scraped_kb_articles/errno95-operation-not-supported.json new file mode 100644 index 0000000000000000000000000000000000000000..3a38a15a826511c1efe5831e9166b5783717c93e --- /dev/null +++ b/scraped_kb_articles/errno95-operation-not-supported.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/errno95-operation-not-supported", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to append data to a file saved on an external storage mount point and are getting an error message:\nOSError: [Errno 95] Operation not supported\n.\nThe error occurs when trying to append to a file from both Python and R.\nCause\nDirect appends and random writes are not supported in FUSE v2, which is available in Databricks Runtime 6.0 and above. This is by design.\nThe underlying storage that is mounted to DBFS does not support append. This means that Databricks would have to download the data, run the append, and reupload the data in order to support the command. This works for small files, but quickly becomes an issue as file size increases. Because the DBFS mount is shared between driver and worker nodes, appending to a file from multiple nodes can cause data corruption.\nSolution\nAs a workaround, you should run your append on a local disk, such as\n/tmp\n, and move the entire file at the end of the operation.\nIf you need to perform cross-session appends, please contact your account team to discuss enabling an NFS mount on your clusters." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-cannot-find-catalog-plugin-class-for-catalog-when-using-a-custom-catalog-plugin-in-jdbc-driver.json b/scraped_kb_articles/error-cannot-find-catalog-plugin-class-for-catalog-when-using-a-custom-catalog-plugin-in-jdbc-driver.json new file mode 100644 index 0000000000000000000000000000000000000000..646c9523816c584efa4e1f9b25a7a8d41b7049a2 --- /dev/null +++ b/scraped_kb_articles/error-cannot-find-catalog-plugin-class-for-catalog-when-using-a-custom-catalog-plugin-in-jdbc-driver.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-cannot-find-catalog-plugin-class-for-catalog-when-using-a-custom-catalog-plugin-in-jdbc-driver", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re using a custom catalog plugin in Databricks. The plugin, packaged as a JAR file and uploaded to your workspace, is configured as a library on your cluster.\nWhen executing a query using the Databricks JDBC driver from an external application (for example, Java code), the encounter the following error.\nError:[_LEGACY_ERROR_TEMP_2215] org.apache.spark.SparkException: Cannot find catalog plugin class for catalog '': .\nHowever, you notice if the same query is first executed in a Databricks notebook on the same cluster, the catalog plugin initializes successfully, allowing subsequent queries on the JDBC driver to run without errors.\nCause\nThe custom catalog plugin is not being initialized when queries are executed using JDBC.\nSolution\nPreload the plugin to make it available for JDBC sessions. Use an init script to place the custom JAR file in the appropriate directories.\nThe following init script places the JAR files. It sets a target directory and then copies the JAR file to the target directory.\n\\#!/bin/bash\r\n\r\n# Target directories\r\nTARGET_DIRECTORIES=(\"/databricks/jars\" \"/databricks/hive_metastore_jars\")\r\n\r\n# Copy the JAR file to the target directories\r\nfor TARGET_DIR in \"${TARGET_DIRECTORIES[@]}\"; do\r\n  cp \"\" \"$TARGET_DIR/\"\r\ndone" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-compressed-buffer-size-exceeds-2-gb-when-saving-data.json b/scraped_kb_articles/error-compressed-buffer-size-exceeds-2-gb-when-saving-data.json new file mode 100644 index 0000000000000000000000000000000000000000..63a01624e1ab19e6f84b68d137a251061c1428b0 --- /dev/null +++ b/scraped_kb_articles/error-compressed-buffer-size-exceeds-2-gb-when-saving-data.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/error-compressed-buffer-size-exceeds-2-gb-when-saving-data", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to save your data, your Apache Spark job fails with the below error.\nCaused by: java.io.IOException: Compressed buffer size exceeds 2147483647. The size of individual input values might be too large. Lower page/block row size checks to write data more often \r\nat org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:83)\r\nat org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)\nCause\nYou have individual records which exceed the 2 GB buffer size limit. The Parquet writer groups records together and checks the block size to determine when to close the row group. However, when a single record is too large, it can cause the buffer size to overflow.\nSolution\nNavigate to your cluster.\nClick\nAdvanced options\n.\nIn the\nSpark config\nbox under the\nSpark\ntab, add the following configuration settings to adjust the Parquet page and block sizes.\nSpark.hadoop.parquet.page.size.row.check.max 1 spark.hadoop.parquet.block.size.row.check.max 1 spark.hadoop.parquet.page.size.row.check.min 1 spark.hadoop.parquet.block.size.row.check.min 1\nThese configurations increase the frequency of row group size checks in Parquet files. The default value for these configs is\n10\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-creating-tables-on-foreign-catalogs-in-databricks-lakehouse-federation.json b/scraped_kb_articles/error-creating-tables-on-foreign-catalogs-in-databricks-lakehouse-federation.json new file mode 100644 index 0000000000000000000000000000000000000000..d67df70f2dd9630338e078bf190afa71f8c8673a --- /dev/null +++ b/scraped_kb_articles/error-creating-tables-on-foreign-catalogs-in-databricks-lakehouse-federation.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-creating-tables-on-foreign-catalogs-in-databricks-lakehouse-federation", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have integrated an external data source (e.g. MySQL, Snowflake, PostgreSQL, etc.) with Databricks using Lakehouse Federation. You can perform read-only queries on the foreign catalog from Unity Catalog, but when you try to create a table on a schema from the foreign catalog you get an error message saying the table cannot be created.\n[RequestId= ErrorClass=INVALID_PARAMETER_VALUE.CHILD_CREATION_FORBIDDEN_FOR_FOREIGN_SECURABLE] Securable `` of type TABLE cannot be created in parent `` of kind SCHEMA_FOREIGN_.\nCause\nLakehouse Federation only supports read operations for external data sources. Write operations, such as creating new schemas and tables, and modifying table contents, are not supported.\nSolution\nTo create a new table, you should create it directly from your external data source. After creating the new table, you can use Unity Catalog to access and read the data.\nFor more information, review the\nWhat is Lakehouse Federation?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-delta-unsupported-time-travel-multiple-formats-while-creating-view.json b/scraped_kb_articles/error-delta-unsupported-time-travel-multiple-formats-while-creating-view.json new file mode 100644 index 0000000000000000000000000000000000000000..8a7a16402bfadfadf7bd2702db3c768245dd060b --- /dev/null +++ b/scraped_kb_articles/error-delta-unsupported-time-travel-multiple-formats-while-creating-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-delta-unsupported-time-travel-multiple-formats-while-creating-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating or querying a view using time travel syntax in a Unity Catalog (UC)-enabled cluster, you receive an error message.\nQuery\nCREATE OR REPLACE VIEW . AS\r\nSELECT * FROM .@v1234\nError\ncom.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_UNSUPPORTED_TIME_TRAVEL_MULTIPLE_FORMATS] Cannot specify time travel in multiple formats.\r\nat org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTimeTravel$$anonfun$apply$15.applyOrElse(Analyzer.scala:2042)\r\nat org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTimeTravel$$anonfun$apply$15.applyOrElse(Analyzer.scala:2040)\nCause\nDelta Lake allows you to reference specific table versions using time travel, but it requires a consistent format for specifying the version or timestamp. The view definition syntax from Hive does not carry over to UC-enabled clusters.\nSolution\nEnsure that you use a three-level namespace (database.schema.table) when creating views in UC.\nCheck the syntax for specifying time travel in your query and ensure that it is consistent.\nIf you need to use multiple formats for specifying time travel, simplify the query to use a single format.\nCREATE OR REPLACE VIEW .. AS\r\nSELECT * FROM ..@v1234" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-delta_clustering_show_create_table_without_clustering_columns-when-running-show-create-table-command.json b/scraped_kb_articles/error-delta_clustering_show_create_table_without_clustering_columns-when-running-show-create-table-command.json new file mode 100644 index 0000000000000000000000000000000000000000..653d05f495d272f857c842eb6e445416f05cf4f5 --- /dev/null +++ b/scraped_kb_articles/error-delta_clustering_show_create_table_without_clustering_columns-when-running-show-create-table-command.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/error-delta_clustering_show_create_table_without_clustering_columns-when-running-show-create-table-command", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter executing the ​​\nALTER TABLE {table} CLUSTER BY NONE\ncommand to remove the liquid clustering columns from the table, you then attempt to run\nSHOW CREATE TABLE\non your Delta table and receive the following error.\n[DELTA_CLUSTERING_SHOW_CREATE_TABLE_WITHOUT_CLUSTERING_COLUMNS] SHOW CREATE TABLE is not supported for Delta table with Liquid clustering without any clustering columns. SQLSTATE: 0A000\nCause\nThe\nSHOW CREATE TABLE\ncommand is not supported for tables that have had their clustering columns removed using the\nALTER\ncommand.\nSolution\nUse Databricks Runtime 15 or above, where the\nSHOW CREATE TABLE\ncommand is supported.\nIf you do not want to update to Databricks Runtime 15 or above, you can drop the liquid clustering table feature.\nStart an all-purpose compute cluster running Databricks Runtime 14.1 or above.\nAttach a notebook to this cluster.\nRun the following\nALTER TABLE\ncommands:\nALTER TABLE DROP FEATURE liquid\r\nALTER TABLE DROP FEATURE clustering\nShut down the cluster and switch back to your existing compute clusters.\nNote\nThe\nALTER TABLE\ncommands to disable liquid clustering may show an error saying the feature is not present. This error can be safely ignored while downgrading." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-delta_merge_incompatible_decimal_type-when-performing-a-merge-operation-in-databricks-runtime-143-lts-or-above.json b/scraped_kb_articles/error-delta_merge_incompatible_decimal_type-when-performing-a-merge-operation-in-databricks-runtime-143-lts-or-above.json new file mode 100644 index 0000000000000000000000000000000000000000..ac06855e00b86b3bf9e8d29484eb32e7efc420e5 --- /dev/null +++ b/scraped_kb_articles/error-delta_merge_incompatible_decimal_type-when-performing-a-merge-operation-in-databricks-runtime-143-lts-or-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/error-delta_merge_incompatible_decimal_type-when-performing-a-merge-operation-in-databricks-runtime-143-lts-or-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile performing a merge operation in Databricks Runtime 14.3 LTS or above, you encounter an error related to incompatible decimal scales.\nCaused by: com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_MERGE_INCOMPATIBLE_DECIMAL_TYPE] Failed to merge decimal types with incompatible scale 6 and 5 (or any other scale) when attempting to merge fields with decimal types in Databricks.\nAdditionally, you may see the following error message.\n[DELTA_FAILED_TO_MERGE_FIELDS] Failed to merge fields '' and '' SQLSTATE: 22005\r\nCaused by: com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_MERGE_INCOMPATIBLE_DECIMAL_TYPE] Failed to merge decimal types with incompatible scale 6 and 5.\nCause\nYou have mismatched decimal scales between the merging fields.\nContext\nA change in the behavior of decimal type casting has been introduced as of Databricks Runtime 14.3 LTS.\nIn earlier Databricks Runtime versions, a bug in Apache Spark caused automatic casting of decimal values during operations, which could result in unintentional loss of precision. Databricks Runtime 14.3 LTS includes a fix which enforces stricter type compatibility for decimal scales.\nSolution\nIf you require compatibility with earlier Databricks Runtime behavior, set the following configuration property using a notebook. This configuration reverts the runtime to the legacy behavior, ensuring that decimal values are cast in a manner compatible with earlier versions.\nspark.conf.set(\"spark.sql.legacy.decimal.retainFractionDigitsOnTruncate\", True)\nFor greater control and accuracy, explicitly cast the columns to a specific decimal type using the\ncast\nfunction. This ensures that the column is explicitly defined with the correct precision and scale, avoiding compatibility issues during operations.\nExample\nfrom pyspark.sql.functions import col\r\ndf = df.withColumn(\"\", col(\"\").cast(\"decimal()\"))" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-delta_unsupported_features_for_read-when-accessing-a-delta-table.json b/scraped_kb_articles/error-delta_unsupported_features_for_read-when-accessing-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..ae1875381403de3f2938506ea36b3ecacc782aee --- /dev/null +++ b/scraped_kb_articles/error-delta_unsupported_features_for_read-when-accessing-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/error-delta_unsupported_features_for_read-when-accessing-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to access a Delta table, you receive the following error message.\n[DELTA_UNSUPPORTED_FEATURES_FOR_READ] Unsupported Delta read feature: table \"..
\" requires reader table feature(s) that are unsupported by this version of Databricks: variantType-preview. Please refer to https://docs.databricks.com/delta/feature-compatibility.html for more information on Delta lake feature compatibility. SQLSTATE: 56038\nCause\nThe Delta feature\nVARIANT\nis enabled for this table. You can verify this by navigating to the table’s\nDetails\ntab within the Catalog Explorer. In the\nProperties\nsection, look for the entry:\n‘\"feature.variantType-preview\": \"supported\"’\nThis entry indicates the\nVARIANT\nfeature is enabled.\nSolution\nEnsure you are using Databricks Runtime 15.3 or above. For further information, please refer to the\nVariant support in Delta Lake\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-download-full-results.json b/scraped_kb_articles/error-download-full-results.json new file mode 100644 index 0000000000000000000000000000000000000000..8d22ad17faae940a414f92dcbdb87a709b652a99 --- /dev/null +++ b/scraped_kb_articles/error-download-full-results.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/error-download-full-results", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to\nDownload full results\nyou get an error.\nError in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/\nReproduce error\nCreate two tables.\n%python\r\n\r\nfrom pyspark.sql.functions import *\r\n\r\ndf = spark.range(12000)\r\ndf = df.withColumn(\"col2\",lit(\"test\"))\r\ndf.createOrReplaceTempView(\"table1\")\r\n\r\ndf1 = spark.range(5)\r\ndf1.createOrReplaceTempView(\"table2\")\nPerform left outer join on the tables.\n%sql\r\n\r\nselect * from table1 t1 left join table2 t2 on t1.id = t2.id\nClick\nDownload preview\n. A CSV file downloads.\nClick\nDownload full results\n. An error is generated.\nCause\nDownload preview\nworks because this is a frontend only operation that runs in the browser. No constraints are checked and only 1000 rows are included in the CSV file.\nDownload full results\nre-executes the query in Apache Spark and writes the CSV file internally. The error occurs when duplicate columns are found after a join operation.\nSolution\nOption 1\nIf you select all the required columns, and avoid duplicate columns after the join operation, you will not get the error and can download the full result.\n%sql\r\n\r\nselect t1.id, t1.col2 from table1 t1 left join table2 t2 on t1.id = t2.id\nOption 2\nYou can use DataFrames to prevent duplicated columns. If there are no duplicated columns after the join operation, you will not get the error and can download the full result.\n%python\r\n\r\nresult_df = df.join(df1, [\"id\"],\"left\")\r\ndisplay(result_df)" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-during-maven-library-installation-error_maven_library_resolution.json b/scraped_kb_articles/error-during-maven-library-installation-error_maven_library_resolution.json new file mode 100644 index 0000000000000000000000000000000000000000..ab4d237e8aa7abb3b1e38e11b12103eaf8058bde --- /dev/null +++ b/scraped_kb_articles/error-during-maven-library-installation-error_maven_library_resolution.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/error-during-maven-library-installation-error_maven_library_resolution", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to install Maven libraries, you receive an error message.\nClusterLibrariesModel.scala:3569 : [30 occurrences] Error code: ERROR_MAVEN_LIBRARY_RESOLUTION. Library hash: XXXXXX, library type: maven, cluster info: {creator: Webapp, access mode: , Spark version: {DBR version}}. Library installation attempted on the driver node of cluster [ClusterID] org [WorkspaceID] and failed due to library resolution error.. Error message: Library resolution failed because unresolved dependency: net.minidev:json-smart:[1.3.1,2.3]: not found\nCause\nA recent release of the Maven package\njson-smart-v2\nhas led to a significant disruption in dependency resolution. The central metadata for this package has been corrupted, resulting in the removal of all versions prior to 2.5.2. This results in an increase in Maven library resolution failures if some previous version of json-smart is required as the transitive dependency.\nFor details, refer to the original Github issue\n2.5.2 Release Breaking Upstream Dependencies #240\n.\nThis issue also affects Google's Maven mirror, which is used by Databricks Runtime versions to resolve Maven libraries. Maven Central is used as the backup for Google's Maven mirror.\nSolution\nDatabricks recommends installing your library separately from the smart-json library. You can use the UI or an API call.\nInstalling libraries using the UI\nInstall your library excluding the json-smart library. Indicate the exclusion in the\nExclusions\ntext box in the\nInstall library\nmodal.\nIn the same\nInstall library\nmodal, indicate the version of\nnet.minidev:json-smart\nyou require, for example\n2.3\n.\nInstalling libraries using an API call\nTo add a library to your cluster or job creation, add the following\n“libraries”\npayload to the\n/api/2.0/libraries/install\nAPI call.\n\"libraries\": [\r\n        {\r\n             \"maven\": {\r\n                \"coordinates\": \"net.minidev:json-smart:\"\r\n             }\r\n         },\r\n         {\r\n            \"maven\": {\r\n                \"coordinates\": \"\",\r\n                \"exclusions\": [\r\n                    \"net.minidev:json-smart:RELEASE\"\r\n                ]\r\n            }\r\n         }\r\n ]\nHost a private Maven mirror\nIf you host a private Maven mirror, you can set the following Apache Spark configuration in your cluster settings.\nspark.databricks.driver.preferredMavenCentralMirrorUrl \nImplement a global init script\nImplement a global init script that downloads all the needed JARs using\n‘mvn dependencies:copy-dependencies`\nfrom Maven, or from your storage location and moves them to\n/databricks/jars\n.\napt update && apt install -y maven\r\ncat > pom.xml << 'EOF'\r\n\r\n    4.0.0\r\n    temp\r\n    temp\r\n    1.0\r\n    \r\n        \r\n            com.microsoft.azure\r\n            azure-eventhubs-spark_2.12\r\n            \r\n            \r\n                \r\n                    net.minidev\r\n                    json-smart\r\n                \r\n            \r\n        \r\n    \r\n\r\nEOF\r\nmkdir -p /tmp/jars \r\nmvn -f pom.xml dependency:copy-dependencies -DoutputDirectory=/tmp/jars\r\ncp /tmp/jars/* /databricks/jars\nIf none of the above options are available, as a last resort you can use another Maven mirror for the repository. Note that Databricks cannot guarantee the quality or safety of the mirror. To select a mirror, refer to the\nmirror repository\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-embedding-aibi-dashboard-in-google-sites.json b/scraped_kb_articles/error-embedding-aibi-dashboard-in-google-sites.json new file mode 100644 index 0000000000000000000000000000000000000000..94fa3471f1695c737e0c367898e0a029e233dbe1 --- /dev/null +++ b/scraped_kb_articles/error-embedding-aibi-dashboard-in-google-sites.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/bi/error-embedding-aibi-dashboard-in-google-sites", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are embedding AI/BI dashboards on Google Sites and are restricting access to specific domains. You have whitelisted\nsites.google.com\n, but your users encounter an\nEmbedding dashboards is not available on this domain\nerror message.\nCause\nDue to Google’s content delivery design and the way Databricks embeds AI/BI dashboards, multiple domains owned and used by Google are required to properly serve embedded content. Specifically\nhttps://www.gstatic.com\nand\n*.googleusercontent.com\nmay be required in addition to\nsites.google.com\n. If the required domains are not whitelisted, embedding fails.\nSolution\nFollow the\nManage dashboard embedding\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation and add the required domains based on your embedding method:\nFor\nBy URL\nembedding add\nsites.google.com\nand\nhttps://www.gstatic.com\nto the list of allowed domains.\nFor\nEmbed Code\nembedding add\nsites.google.com\n,\nhttps://www.gstatic.com\n, and\n*.googleusercontent.com\nto the list of allowed domains." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-filenotfoundexception-while-streaming-job-or-reading-delta-table-even-with-ignoremissingfiles-set-.json b/scraped_kb_articles/error-filenotfoundexception-while-streaming-job-or-reading-delta-table-even-with-ignoremissingfiles-set-.json new file mode 100644 index 0000000000000000000000000000000000000000..1114ae6873dec710d63628ed1b402e03059430c6 --- /dev/null +++ b/scraped_kb_articles/error-filenotfoundexception-while-streaming-job-or-reading-delta-table-even-with-ignoremissingfiles-set-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/error-filenotfoundexception-while-streaming-job-or-reading-delta-table-even-with-ignoremissingfiles-set-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile running a streaming job or reading a Delta table, you receive an error even though you have set\nignoremissingfiles\n.\nJob aborted due to stage failure: Error while reading file dbfs:/mnt/...snappy.parquet.\r\nCaused by: IOException: java.io.FileNotFoundException: Operation failed: 'The specified path does not exist.', 404, GET\nCause\nThere is a discrepancy between the metadata and the data files.\nIt is also possible that your Delta logs have stale metadata entries that reference files no longer in the storage location.\nSolution\nUse the\nFSCK REPAIR TABLE\ncommand to synchronize the metadata with the data files. This command removes metadata entries for files that are not present in the underlying file system. Execute the following command in your Databricks notebook.\nFSCK REPAIR TABLE delta.``\nFurther, ensure all files referenced in the Delta logs are present in the storage location. Manually check the storage directory or use automated scripts to verify file existence." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-invalid_temp_obj_reference-when-trying-to-create-a-view.json b/scraped_kb_articles/error-invalid_temp_obj_reference-when-trying-to-create-a-view.json new file mode 100644 index 0000000000000000000000000000000000000000..a3f37c0ce0b169445dd8a6beea48004aa4b6e804 --- /dev/null +++ b/scraped_kb_articles/error-invalid_temp_obj_reference-when-trying-to-create-a-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/error-invalid_temp_obj_reference-when-trying-to-create-a-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating a\nVIEW\nusing a temporary object as a source, you encounter an\nINVALID_TEMP_OBJ_REFERENCE\nerror.\nExample error message\n[INVALID_TEMP_OBJ_REFERENCE] Cannot create the persistent object `main`.`default`.`table` of the type VIEW because it references to the temporary object `thisIsView` of the type VIEW. Please make the temporary object `thisIsView` persistent, or make the persistent object `main`.`default`.`table` temporary. SQLSTATE: 42K0F\nCause\nTemporary objects are limited to a session and are not persistent.\nSolution\nFirst, persist the temporary object to a location, such as a materialized view or a new table. Then create your\nVIEW\n.\nNote\nTemporary views have limited scope and persistence, and aren't registered to a schema or catalog.\nFor more information, please refer to the\nWhat is a view?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-javaiofilenotfoundexception-when-job-attempts-to-read-or-write-intermediary-files.json b/scraped_kb_articles/error-javaiofilenotfoundexception-when-job-attempts-to-read-or-write-intermediary-files.json new file mode 100644 index 0000000000000000000000000000000000000000..9bd13e46c1d96426565ceecec06a56d672d85244 --- /dev/null +++ b/scraped_kb_articles/error-javaiofilenotfoundexception-when-job-attempts-to-read-or-write-intermediary-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/error-javaiofilenotfoundexception-when-job-attempts-to-read-or-write-intermediary-files", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen your job attempts to read or write intermediary files, such as Excel files, in the Databricks File System (DBFS), you encounter a\njava.io.FileNotFoundException\nerror.\nThe error message may include paths with the workspace ID, such as\n/00000000000000///.xlsx\n.\nCause\nThe job is attempting to access a file that has been deleted or is in the process of being deleted.\nIncorrect or expired storage account authentication can also cause a similar file access issue.  In such cases, the error message includes an unauthorized error.\nSolution\nCache the DataFrame before performing write operations. This ensures that the data is fully loaded into memory before any file operations are attempted. Add the following line before the write operation.\ndf.cache().show()\nAlso, ensure that you are using a version of the\ncom.crealytics.spark.excel\nlibrary compatible with your Databricks Runtime version. For Databricks Runtime 13.3, use the following Maven coordinate.\ncom.crealytics:spark-excel_2.12:3.4.1_0.20.4\nFor Maven versions for other Databricks Runtime versions, refer to Maven Central’s\ncom.crealytics\ndocumentation.\nNote\nIf possible, use CSV format for intermediary storage instead of Excel because Apache Spark has native support for CSV files." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-javalangunsupportedoperationexception-when-trying-to-read-datetime-data-files.json b/scraped_kb_articles/error-javalangunsupportedoperationexception-when-trying-to-read-datetime-data-files.json new file mode 100644 index 0000000000000000000000000000000000000000..d8083b872ae752ea173976af516bcd90d8ca7368 --- /dev/null +++ b/scraped_kb_articles/error-javalangunsupportedoperationexception-when-trying-to-read-datetime-data-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/error-javalangunsupportedoperationexception-when-trying-to-read-datetime-data-files", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to read datetime data files, you encounter an error.\njava.lang.UnsupportedOperationException with the message: \"LEGACY datetime rebase mode is only supported for files written in UTC timezone. Actual file timezone: Asia/Kolkata.\" This error occurs when attempting to read data files that were written in a timezone other than UTC while using the LEGACY datetime rebase mode.\nCause\nThe data files are written in a time zone other than UTC. The\nLEGACY\ndatetime rebase mode in Apache Spark is designed to handle datetime values based on the UTC timezone. When files are written in a different timezone, such as Asia/Kolkata, the rebase mode cannot correctly interpret the datetime values, leading to the\nUnsupportedOperationException\n.\nSolution\nConfigure your Spark cluster’s datetime rebase mode.\nThe\nspark.sql.legacy.parquet.datetimeRebaseModeInRead\nconfiguration allows Spark to read the datetime values in the legacy rebase mode, even if the files were written in a timezone other than UTC.\nNavigate to the cluster configuration page in your Databricks workspace.\nClick the\nAdvanced Options\ntoggle.\nAdd the following configuration in the\nSpark configuration\ntab.\nspark.sql.legacy.parquet.datetimeRebaseModeInRead LEGACY\nFor more information, review the Spark\nParquet Files\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-job-aborted-due-to-stage-failure-when-trying-to-shallow-clone-a-large-table-with-significant-deltalog-entries.json b/scraped_kb_articles/error-job-aborted-due-to-stage-failure-when-trying-to-shallow-clone-a-large-table-with-significant-deltalog-entries.json new file mode 100644 index 0000000000000000000000000000000000000000..b144365481d5c1cd852a9f843510077ac43d56db --- /dev/null +++ b/scraped_kb_articles/error-job-aborted-due-to-stage-failure-when-trying-to-shallow-clone-a-large-table-with-significant-deltalog-entries.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/error-job-aborted-due-to-stage-failure-when-trying-to-shallow-clone-a-large-table-with-significant-deltalog-entries", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou attempt to shallow clone a large table with a significant number of DeltaLog entries using the following command.\nCREATE TABLE SHALLOW CLONE \nYou receive the following error message.\n\"Job aborted due to stage failure: Total size of serialized results of tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.\"\nCause\nThe cloning process needs to collect entries on the driver before committing, which can result in a large amount of data overflowing in memory when the metadata is laid out.\nOverflowing memory happens when the\nSHALLOW CLONE\noperation involves an excessive amount of\nAddFile\nentries.\nSolution\nSet\nspark.driver.maxResultSize\nabove\n4 GB\n. Start with\n6 GB\n, and if needed, increase further. Consider increasing driver instance size as well.\nFor more information, refer to the\nSpark Configuration\ndocumentation.\nPreventive measures\nMonitor the size of the metadata and transaction log, and the number of files in the tables you are cloning.\nConsider optimizing the tables themselves, such as by vacuuming and optimizing, to reduce the scale of the shallow clone task.\nFor more information, review the\nDiving Into Delta Lake: Unpacking The Transaction Log\nDatabricks blog post." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-jvm_attribute_not_supported-when-trying-to-obtain-the-number-of-partitions-in-a-dataframe.json b/scraped_kb_articles/error-jvm_attribute_not_supported-when-trying-to-obtain-the-number-of-partitions-in-a-dataframe.json new file mode 100644 index 0000000000000000000000000000000000000000..f62beb4d79a0c743362d0fc4d9ece35a3b8cf2de --- /dev/null +++ b/scraped_kb_articles/error-jvm_attribute_not_supported-when-trying-to-obtain-the-number-of-partitions-in-a-dataframe.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-jvm_attribute_not_supported-when-trying-to-obtain-the-number-of-partitions-in-a-dataframe", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to use the\ndf.rdd.getNumPartitions\nmethod to obtain the number of partitions in a DataFrame, the command fails with the following error.\n[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `rdd` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit\nhttps://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession\nfor creating regular Spark Session in detail.\nCause\nAs of Databricks Runtime 14.0 and above, shared clusters use the Apache Spark Connect architecture. RDD APIs are not supported in shared access modes.\nSolution\nSwitch to a single-user cluster.\nIf switching to single-user cluster is not feasible, you can use the\nspark_partition_id()\nfunction, which can be used in shared clusters.\nThe following code snippet retrieves the distinct\nspark_partition_id()\nvalues and counts them to determine the number of partitions in the DataFrame.\n%python\r\nfrom pyspark.sql import functions as F\r\ndf.select(F.spark_partition_id()).distinct().count()\nFor more information, review the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-listdatabases-method-is-not-allowlisted-when-running-catalogimpllistdatabases-on-a-no-isolation-shared-cluster.json b/scraped_kb_articles/error-listdatabases-method-is-not-allowlisted-when-running-catalogimpllistdatabases-on-a-no-isolation-shared-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..726f37aa1531b0d47e6f18e61ba216374e926512 --- /dev/null +++ b/scraped_kb_articles/error-listdatabases-method-is-not-allowlisted-when-running-catalogimpllistdatabases-on-a-no-isolation-shared-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-listdatabases-method-is-not-allowlisted-when-running-catalogimpllistdatabases-on-a-no-isolation-shared-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to run the\nspark.catalog.listDatabases()\ncommand in a no isolation shared (formerly high concurrency) cluster, you receive a\npy4j.security.Py4JSecurityException\nerror, indicating that the\nlistDatabases\nmethod in the\nCatalogImpl\nclass is not allowlisted.\nCause\nNo isolation shared (formerly high concurrency) clusters have stricter security settings. The\nlistDatabases\nmethod is not allowlisted by default in these clusters, leading to the observed error.\nSolution\nUse Databricks Runtime 14.1 or above to use\nspark.catalog\nfunctions in shared clusters for Python and Scala.\nIf you need to work with a Databricks Runtime below 14.1, switch to a dedicated (formerly single user) cluster.\nPreventative measures\nRegularly review and update your cluster configurations to ensure they meet your needs and comply with Databricks security recommendations.\nStay up-to-date with the latest Databricks runtime versions and releases to take advantage of new features, bug fixes, and security improvements. For details, review the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-message-when-trying-to-send-a-json-input-data-request-to-a-model-endpoint.json b/scraped_kb_articles/error-message-when-trying-to-send-a-json-input-data-request-to-a-model-endpoint.json new file mode 100644 index 0000000000000000000000000000000000000000..7271f72f61c76b75d4e97562c53b8d675618839a --- /dev/null +++ b/scraped_kb_articles/error-message-when-trying-to-send-a-json-input-data-request-to-a-model-endpoint.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/error-message-when-trying-to-send-a-json-input-data-request-to-a-model-endpoint", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter loading your model using the Pyfunc interface in a notebook and passing it in as a Pandas DataFrame, you then try to send a JSON input data request to the model’s endpoint. In response you receive an error message.\n{\"error_code\": \"BAD_REQUEST\", \"message\": \"Encountered an unexpected error while evaluating the model. Verify that the input is compatible with the model for inference. Error ''<' not supported between instances of 'NoneType' and 'NoneType'\"}.\nThe issue occurs when the model is a feature_store flavor with a scikit-learn sub-flavor.\nCause\nThere is a discrepancy between the endpoint and the loaded model in a notebook to get predictions, where the feature column is\nNULL\nfor all rows. This causes the entire column to be\nNone\n.\nSolution\nPass one more input parameter from the list of input columns in the endpoint’s model signature. This adjustment allows the endpoint to process the input correctly." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-number-of-currently-active-jobs-exceeds-hard-limit-of-spark-databricks-maxactivejobs-when-trying-to-run-an-api-request.json b/scraped_kb_articles/error-number-of-currently-active-jobs-exceeds-hard-limit-of-spark-databricks-maxactivejobs-when-trying-to-run-an-api-request.json new file mode 100644 index 0000000000000000000000000000000000000000..d278561f0c11d5ba5188ca64c5aaf5aee67ab69d --- /dev/null +++ b/scraped_kb_articles/error-number-of-currently-active-jobs-exceeds-hard-limit-of-spark-databricks-maxactivejobs-when-trying-to-run-an-api-request.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/error-number-of-currently-active-jobs-exceeds-hard-limit-of-spark-databricks-maxactivejobs-when-trying-to-run-an-api-request", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to run a notebook or job API request, you encounter the following error.\nERROR Uncaught throwable from user code: org.apache.spark.SparkException: The number of currently active jobs (XXXX) exceeds the hard limit of spark.databricks.maxActiveJobs=2000.\nCause\nThe number of active jobs (\n$numActive\n) in your Apache Spark application has exceeded the limit set by the\nspark.databricks.maxActiveJobs\nparameter, which is set to\n2000\n.\nSolution\nIn your cluster settings, raise the configured limit from the default\n2000\nto N, depending on your needs.\nImportant\nThe\nspark.databricks.maxActiveJobs\nsetting is used to limit the number of concurrent active jobs to prevent resource contention and ensure system stability. Use caution when increasing this limit, and consider your available physical resources (CPU, memory, and so on) to avoid potential performance degradation.\nRaising this limit can indirectly lead to increased costs if it results in higher resource usage or degraded performance. Before implementing the solution, consider whether the job structure itself is generating too many concurrent jobs due to inefficient logic. If possible, optimize job design first to reduce the need to raise this limit.\nThe\nspark.databricks.maxActiveJobs\nconfiguration must be set at the cluster level and cannot be set programmatically within a job.\n1. In the Databricks UI, navigate to the\nCompute\nmenu option in the vertical menu on the left.\n2. Select the cluster you are using.\n3. In the\nCluster Configuration\ntab, click the\nEdit\nbutton in the top right.\n4. Scroll down to the\nAdvanced Options\nsection and click to expand.\n5. Enter the configuration\nspark.databricks.maxActiveJobs\nwith the desired value in the\nSpark Config\nfield.\nspark.databricks.maxActiveJobs \n6. Save the changes and restart the cluster for the new configuration to take effect." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-permission_denied-when-running-notebooks-with-a-group-assigned-dedicated-compute.json b/scraped_kb_articles/error-permission_denied-when-running-notebooks-with-a-group-assigned-dedicated-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..abfe4d51862b0788a9ab0f3f6cb2754ca0b2cfbf --- /dev/null +++ b/scraped_kb_articles/error-permission_denied-when-running-notebooks-with-a-group-assigned-dedicated-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/error-permission_denied-when-running-notebooks-with-a-group-assigned-dedicated-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to run a notebook or job using a dedicated compute assigned to a user group. The group has been granted\nCAN RUN\npermissions on the workspace folder that contains the notebook. However, when executing the notebook, you encounter an error similar to the following.\nPy4JJavaError: An error occurred while calling o480.run.\r\n: com.databricks.WorkflowException: com.databricks.common.client.DatabricksServiceHttpClientException: PERMISSION_DENIED: The user '' does not have access to Databricks Workspace\nCause\nThe user group does not have the workspace access entitlement enabled. Even though the group has folder-level permissions, this entitlement is required for users to access and interact with objects in the workspace, including notebooks and jobs.\nSolution\nHave a workspace administrator grant workspace access entitlement to the group assigned to the compute. They should follow these steps.\nIn the Databricks UI, click\nSettings\n.\nSelect the\nIdentity and access\ntab.\nClick\nGroups\n.\nSearch for the relevant user group named in the error, and click its name.\nIn the group settings, locate the\nEntitlements\nsection.\nCheck if\nWorkspace access\nis listed.\nIf not enabled, click\nGrant\nnext to\nWorkspace access\n.\nFor more information, review the\nManage entitlements\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor information about how to manage the groups using the account console, refer to the\nManage groups\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-pysparknotimplementederror-when-using-an-rdd-to-extract-distinct-values-on-a-standard-cluster.json b/scraped_kb_articles/error-pysparknotimplementederror-when-using-an-rdd-to-extract-distinct-values-on-a-standard-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..81641a65208b4b7d65e12407894950df98b8db0d --- /dev/null +++ b/scraped_kb_articles/error-pysparknotimplementederror-when-using-an-rdd-to-extract-distinct-values-on-a-standard-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/error-pysparknotimplementederror-when-using-an-rdd-to-extract-distinct-values-on-a-standard-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to extract distinct values from a PySpark DataFrame using a resilient distributed dataset (RDD) on a standard cluster, you receive a PySpark not implemented error.\nExample code\ndf.select(\"column_1\").distinct().rdd.flatMap(lambda x: x).collect()\nExample error message\nPySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented.\nCause\nApache Spark RDD APIs are not supported in standard (formerly shared) access mode clusters.\nFor more information, refer to the Spark API limitations and requirements for Unity Catalog standard access mode section of the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nIn your standard access mode cluster, use the\n.collect()\nmethod directly on the distinct DataFrame. This retrieves all the distinct rows in a list.\nThen apply a list comprehension to extract the\ncolumn_1\nvalues from the row objects.\nrows = df.select(\"column_1\").distinct().collect()\r\ndistinct_column_1 = [row.column_1 for row in rows]" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-run-msck-repair-table-parallel.json b/scraped_kb_articles/error-run-msck-repair-table-parallel.json new file mode 100644 index 0000000000000000000000000000000000000000..5b54cdb0810a8e472ce17e0999e8360940a53148 --- /dev/null +++ b/scraped_kb_articles/error-run-msck-repair-table-parallel.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/error-run-msck-repair-table-parallel", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to run\nMSCK REPAIR TABLE \ncommands for the same table in parallel and are getting\njava.net.SocketTimeoutException: Read timed out\nor out of memory error messages.\nCause\nWhen you try to add a large number of new partitions to a table with\nMSCK REPAIR\nin parallel, the Hive metastore becomes a limiting factor, as it can only add a few partitions per second. The greater the number of new partitions, the more likely that a query will fail with a\njava.net.SocketTimeoutException: Read timed out\nerror or an out of memory error message.\nSolution\nYou should not attempt to run multiple\nMSCK REPAIR TABLE \ncommands in parallel.\nDatabricks uses multiple threads for a single\nMSCK REPAIR\nby default, which splits\ncreatePartitions()\ninto batches. By limiting the number of partitions created, it prevents the Hive metastore from timing out or hitting an out of memory error. It also gathers the fast stats (number of files and the total size of files) in parallel, which avoids the bottleneck of listing the metastore files sequentially. This is controlled by\nspark.sql.gatherFastStats\n, which is enabled by default." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-running-parameterized-sql-queries-in-databricks-connect-with-vs-code.json b/scraped_kb_articles/error-running-parameterized-sql-queries-in-databricks-connect-with-vs-code.json new file mode 100644 index 0000000000000000000000000000000000000000..838a65f610fa25fb19ce08271d9b73e9cb027369 --- /dev/null +++ b/scraped_kb_articles/error-running-parameterized-sql-queries-in-databricks-connect-with-vs-code.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/error-running-parameterized-sql-queries-in-databricks-connect-with-vs-code", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running parameterized SQL queries from an interactive session in Databricks Connect with Visual Studio Code (VS Code), you receive an error message despite running the same queries successfully from a notebook.\nExample\nspark.sql(\"SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}\", bound1=7, bound2=9).show()\nTypeError: SparkSession.sql() got an unexpected keyword argument ''\nCause\nThe\nSparkSession.sql()\nmethod in Databricks Connect does not support passing parameters directly as keyword arguments.\nSolution\nUse string interpolation to pass parameters into the SQL query.\nDefine the parameters separately.\nbound1 = 7\r\nbound2 = 9\nUse an f-string to interpolate the parameters into the SQL query.\nspark.sql(f\"SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}\").show()" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-trying-to-access-a-unity-catalog-uc-volume-path-from-a-non-uc-compute.json b/scraped_kb_articles/error-trying-to-access-a-unity-catalog-uc-volume-path-from-a-non-uc-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..54e1cc512fb8c9144227fc4084b1b26f6db87d52 --- /dev/null +++ b/scraped_kb_articles/error-trying-to-access-a-unity-catalog-uc-volume-path-from-a-non-uc-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-trying-to-access-a-unity-catalog-uc-volume-path-from-a-non-uc-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to access a Unity Catalog (UC) volume path from a non-UC compute, you receive the following error.\ncom.databricks.backend.daemon.data.common.InvalidMountException: Error while using path /Volumes/{path-to-your-volume} for creating file system within mount at '/Volumes/{path-to-your-volume}'.\nUpon reviewing the driver logs or the error stack trace, you find the following message.\nCaused by: java.lang.IllegalStateException: No Unity API token found in Unity Scope\nCause\nYou are using a non-UC compute to access volumes. Since the compute does not support Unity Catalog, it cannot retrieve the required Unity API token, causing the error.\nSolution\nTo access Unity Catalog volumes, you must use a Unity Catalog-enabled compute.\nFor more information, refer to the\nWhat are Unity Catalog volumes?\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-viewtable-does-not-exist-during-model-or-query-runtime-in-sql-analytics.json b/scraped_kb_articles/error-viewtable-does-not-exist-during-model-or-query-runtime-in-sql-analytics.json new file mode 100644 index 0000000000000000000000000000000000000000..17986569bc64d8cbb08a3f6219429ca5e366166f --- /dev/null +++ b/scraped_kb_articles/error-viewtable-does-not-exist-during-model-or-query-runtime-in-sql-analytics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/error-viewtable-does-not-exist-during-model-or-query-runtime-in-sql-analytics", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with SQL Analytics you encounter an error during runtime in your model or query indicating that a view or table does not exist, even after it has been created.\n[TABLE_OR_VIEW_NOT_FOUND] The table or view .. cannot be found.\nCause\nYou’re executing a job that uses\nCREATE OR REPLACE\non an underlying view while running another job concurrently that attempts to read the same view.\nReplacing a view is a delete-and-recreate operation, which means the view's old information, (including\ntable_id\nand credentials) is not retained. If a query attempts to access the view while it is being replaced, the credentials will no longer be valid, leading to the\nview/table does not exist\nerror.\nSolution\nDatabricks recommends using the\nALTER VIEW\ncommand to update the view definition instead of\nCREATE OR REPLACE VIEW\n.  This approach preserves the\ntable_id\nand ensures that the view remains accessible during updates.\nExample\nALTER VIEW .. AS\r\nSELECT * FROM ..;\nFor further guidance, refer to the\nALTER VIEW\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-attempting-to-install-torch-for-r-package.json b/scraped_kb_articles/error-when-attempting-to-install-torch-for-r-package.json new file mode 100644 index 0000000000000000000000000000000000000000..6c2324cb42148f45f39a77f9da85231816392084 --- /dev/null +++ b/scraped_kb_articles/error-when-attempting-to-install-torch-for-r-package.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/error-when-attempting-to-install-torch-for-r-package", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to install the package torch for R in compute using Databricks Runtime 16.1 ML, you encounter an error indicating certain dependencies are missing.\nERROR: dependencies ‘coro’, ‘safetensors’ are not available for package ‘torch’ * removing ‘/usr/local/lib/R/site-library/torch’\nCause\nNot all libraries required by a package are installed by default.\nSolution\nUse the following code to install the required dependencies before installing torch for R.\n%sh\r\n# Download the file using wget and save it to /tmp\r\nwget -O /tmp/torch_0.13.0.9001.tar.gz \"https://github.com/s-u/torch/releases/download/e93430/torch_0.13.0.9001.tar.gz\"\r\n\r\n%r\r\ninstall.packages(c(\"coro\", \"safetensors\"))\r\n\r\n%r\r\ninstall.packages(\"/tmp/torch_0.13.0.9001.tar.gz\", repos = NULL, type = \"source\")\r\n\r\n%r\r\nlibrary(torch)" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-attempting-to-use-stream_read_table-function-from-sparklyr-in-14-3-lts-ml.json b/scraped_kb_articles/error-when-attempting-to-use-stream_read_table-function-from-sparklyr-in-14-3-lts-ml.json new file mode 100644 index 0000000000000000000000000000000000000000..bb283bda5d93e5dc4b5b363ea1726df4b3314cc0 --- /dev/null +++ b/scraped_kb_articles/error-when-attempting-to-use-stream_read_table-function-from-sparklyr-in-14-3-lts-ml.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/error-when-attempting-to-use-stream_read_table-function-from-sparklyr-in-14-3-lts-ml", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to use the\nstream_read_table\nfunction from the Sparklyr package version 1.8.6 in Databricks Runtime 14.3 LTS ML, you receive an error message.\nError in stream_read_table(sc, 'table_name'): could not find function 'stream_read_table'.\nCause\nDatabricks Runtime 14.3 LTS ML includes Sparklyr version 1.8.1 by default, but the\nstream_read_table\nfunction requires version 1.8.6.\nSolution\nInstall Sparklyr version 1.8.6. Navigate to the cluster's library settings and add the Sparklyr package. This action will replace the default version, 1.8.1.\nRestart the cluster to implement the updated changes.\nRe-execute the\nstream_read_table\nfunction to verify the error is resolved.\nAlternatively, update Databricks Runtime to 16.0 or above, which uses Sparklyr 1.8.6 by default.\nImportant\nGenerally, changing package versions can potentially cause conflicts with other internal dependency changes. In this solution, Sparklyr 1.8.6 is a patch update, so no conflicts are anticipated. However as a precaution, you can test the version change in a development environment before deploying to production." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-creating-a-dataframe-with-values-of-different-data-types.json b/scraped_kb_articles/error-when-creating-a-dataframe-with-values-of-different-data-types.json new file mode 100644 index 0000000000000000000000000000000000000000..ad22303dbfe8cbb8e69f6ebea1b4a26b12d2739a --- /dev/null +++ b/scraped_kb_articles/error-when-creating-a-dataframe-with-values-of-different-data-types.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/error-when-creating-a-dataframe-with-values-of-different-data-types", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou create a DataFrame using a complex dictionary and display it, such as in the following example.\nresult = [\r\n    {\r\n        \"A\": {'AA': 'aa', 'BB': {'AAA': 'aaa'}}\r\n    }\r\n]\r\ndf = spark.createDataFrame(result)\r\ndf.display()\nUpon executing, you receive the following error.\n[CANNOT_INFER_TYPE_FOR_FIELD] Unable to infer the type of the field `A`.\nThe following screenshot shows the error in the UI.\nCause\nBy default, a dictionary’s values in its key-value pairs should have the same datatype.\nIn the example code, the first value\n‘aa’\nis a string, and the second value\n{'AAA': 'aaa'}\nis a further key-value pair. The difference causes the error.\nSolution\nTo enable your code logic to accept different datatypes for different values in the dictionary, set the following Apache Spark configuration in the same notebook in a preceding cell.\nspark.conf.set(\"spark.sql.pyspark.inferNestedDictAsStruct.enabled\", True)\nThe following screenshot shows a notebook setting this configuration in a cell before the DataFrame code, and then the original DataFrame code runs successfully. The schema allows\nAA\nto have a string value, while\nBB\nhas a key-value pair value." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-creating-a-delta-table-using-the-ui-and-external-data-in-delta-format.json b/scraped_kb_articles/error-when-creating-a-delta-table-using-the-ui-and-external-data-in-delta-format.json new file mode 100644 index 0000000000000000000000000000000000000000..3011ba87cf6210af78cd46f76b17c1b05a05032b --- /dev/null +++ b/scraped_kb_articles/error-when-creating-a-delta-table-using-the-ui-and-external-data-in-delta-format.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/error-when-creating-a-delta-table-using-the-ui-and-external-data-in-delta-format", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to use the UI to create a Delta table with an external data source in Delta format, you get the following error.\nCloudFilesIllegalArgumentException: Reading from a Delta table is not supported with this syntax.\nCause\nThe UI is designed to create Delta tables from external data in CSV, TSV, JSON, Avro, Parquet, or text file formats.\nSolution\nCreate the Delta table using a notebook command instead.\nOpen a notebook in your Databricks workspace.\nChoose one of the following methods based on the table type you want to create.\nExternal table type\nCREATE TABLE .. LOCATION ;\nManaged table type\ndf=spark.read.format(\"delta\").load(\"\")\r\ndf.write.mode(\"overwrite\").saveAsTable(\"\")\nFor more information, review the\nDelta table streaming reads and writes\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-creating-a-user-group-or-service-principal-at-the-account-level-with-terraform.json b/scraped_kb_articles/error-when-creating-a-user-group-or-service-principal-at-the-account-level-with-terraform.json new file mode 100644 index 0000000000000000000000000000000000000000..f085b435bff8aa5c1cef30e51775d05f4c3c484c --- /dev/null +++ b/scraped_kb_articles/error-when-creating-a-user-group-or-service-principal-at-the-account-level-with-terraform.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/terraform/error-when-creating-a-user-group-or-service-principal-at-the-account-level-with-terraform", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nUnity Catalog uses Databricks account identities to resolve users, service principals, and groups, and to enforce permissions. These identities can be managed using Terraform.\nYou are trying to create users, service principals, or groups at the account level when your Terraform code fails with a\nset `host` property\nerror message.\n2022-10-06T15:20:46.816+0300 [INFO]  Starting apply for databricks_group.bfr_databricks_groups\r\n2022-10-06T15:20:46.817+0300 [DEBUG] databricks_group.bfr_databricks_groups: applying the planned Create change\r\n2022-10-06T15:20:46.818+0300 [INFO]  provider.terraform-provider-databricks_v1.4.0: Using directly configured basic authentication: timestamp=2022-10-06T15:20:46.817+0300\r\n2022-10-06T15:20:46.818+0300 [INFO]  provider.terraform-provider-databricks_v1.4.0: Configured basic auth: host=\nhttps://accounts.cloud.databricks.com\n, username=, password=***REDACTED***: timestamp=2022-10-06T15:20:46.818+0300\r\n2022-10-06T15:20:46.818+0300 [DEBUG] provider.terraform-provider-databricks_v1.4.0: POST /api/2.0/preview/scim/v2/Groups {\r\n  \"displayName\": \"test\",\r\n  \"schemas\": [\r\n    \"urn:ietf:params:scim:schemas:core:2.0:Group\"\r\n  ]\r\n}: timestamp=2022-10-06T15:20:46.818+0300\r\n2022-10-06T15:20:48.283+0300 [DEBUG] provider.terraform-provider-databricks_v1.4.0: 405 Method Not Allowed [non-JSON document of 334 bytes]: timestamp=2022-10-06T15:20:48.283+0300\r\n2022-10-06T15:20:48.283+0300 [WARN]  provider.terraform-provider-databricks_v1.4.0: /api/2.0/preview/scim/v2/Groups:405 - Databricks API (/api/2.0/preview/scim/v2/Groups) requires you to set `host` property (or DATABRICKS_HOST env variable) to result of `databricks_mws_workspaces.this.workspace_url`. This error may happen if you're using provider in both normal and multiworkspace mode. Please refactor your code into different modules. Runnable example that we use for integration testing can be found in this repository at\nhttps://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/aws-workspace\n: timestamp=2022-10-06T15:20:48.283+0300\nCause\nThe\naccount_id\nparameter is missing in the Terraform Databricks provider block.\nSolution\nYou must add your Databricks\naccount_id\nto the Terraform Databricks provider block.\nDelete\nInfo\nYou must be an admin user to get your Databricks account ID.\nLogin to the Databricks account console (\nAWS\n|\nAzure\n).\nClick\nUser Profile\n.\nLook for the\nAccount ID\nvalue in the pop-up.\n// initialize provider at account-level\r\nprovider \"databricks\" {\r\n alias = \"mws\"\r\n host = \"https://accounts.cloud.databricks.com\"\r\n account_id = \"\r\n username = var.databricks_account_username\r\n password = var.databricks_account_password\r\n}\nPlease review the Terraform\ndatabricks_group Resource\ndocumentation for more details." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-access-azure-storage-account-from-china-region.json b/scraped_kb_articles/error-when-trying-to-access-azure-storage-account-from-china-region.json new file mode 100644 index 0000000000000000000000000000000000000000..80d47e4317a56c3ad91af609816d10b0bc05fd4a --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-access-azure-storage-account-from-china-region.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/error-when-trying-to-access-azure-storage-account-from-china-region", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to access your Azure storage account, or external tables associated with that account, from a China region-based Azure Databricks environment, you receive an error.\n`Failure to initialize configuration for storage account xxxxxxxxxxxxxx.dfs.core.chinacloudapi.cn: Invalid configuration value detected for fs.azure.account.key`\nCause\nThere is an issue with your Apache Spark properties used to configure Azure credentials to access the Azure storage account.\nAlternatively, there is an issue with the Spark configuration setting for the OAuth endpoint.\nSolution\nFirst, make sure your Spark properties do not have any of the following.\nTypos\nIncorrect or invalid Spark configurations\nExpired secrets\nMissing required permissions for the service principal on the storage account\nMissing configured external location credentials to access the storage account\nThen check your Spark configuration setting for the OAuth endpoint. Change the endpoint setting to the correct value for the China region.\nspark.hadoop.fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net https://login.chinacloudapi.cn//oauth2/v2.0/token\nPreventative measures\nEnsure that the Spark configuration settings are correct for the specific Azure region and environment.\nAll the national clouds authenticate users separately in each environment and have separate authentication endpoints. For more information, refer to Microsoft’s\nNational clouds\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-copy-datasets-to-different-regions-using-delta-sharing.json b/scraped_kb_articles/error-when-trying-to-copy-datasets-to-different-regions-using-delta-sharing.json new file mode 100644 index 0000000000000000000000000000000000000000..c4a42db473671093fec06ed2e7ca72bcf4d04334 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-copy-datasets-to-different-regions-using-delta-sharing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/error-when-trying-to-copy-datasets-to-different-regions-using-delta-sharing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to copy data between different regions using Delta Sharing, you encounter the following errors.\nError: Failed to parse the schema. Encountered unsupported Delta data type: VOID\nError: Interval day to second is not a supported delta data type\nCause\nDelta sharing fails at the table level because Delta Lake does not support the\n\"Void”\nand\n\"interval day to second\"\ndata types.\nSolution\nYou have three choices depending on your use case.\nFirst, instead of using a single\n\"interval day to second\"\ncolumn, split the interval into separate columns for days, hours, minutes, and seconds. Each of these components can be stored as individual integer columns. You can still represent the interval, but in a Delta Sharing-supported format.\nAlternatively, convert the interval to a string. For example, you can store the interval as a string in the format\n\"D days HH:MM:SS\"\n. This allows you to keep the interval information in a single column, although it will be stored as a string.\nLast, convert the entire interval into a total number of seconds and store it as an integer. This approach simplifies the storage and can be useful if you only need to perform arithmetic operations on the interval." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-create-a-distributed-ray-dataset-using-from_spark-function.json b/scraped_kb_articles/error-when-trying-to-create-a-distributed-ray-dataset-using-from_spark-function.json new file mode 100644 index 0000000000000000000000000000000000000000..55ce2e77d19b22db0586199b2e7b90fe45363c10 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-create-a-distributed-ray-dataset-using-from_spark-function.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/error-when-trying-to-create-a-distributed-ray-dataset-using-from_spark-function", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to create a distributed Ray dataset from an Apache Spark DataFrame using the\nray.data.from_spark()\nfunction, you encounter the following error.\nRuntimeError: In databricks runtime, if you want to use 'ray.data.from_spark' API, you need to set spark cluster config 'spark.databricks.pyspark.dataFrameChunk.enabled' to 'true'.\r\nFile , line 3\r\n      1 import ray.data\r\n----> 3 ray_dataset = ray.data.from_spark(dataframe)\nCause\nThe\nspark.databricks.pyspark.dataFrameChunk.enabled\nconfiguration is set to\nfalse\nby default.\nSolution\nSet\nspark.databricks.pyspark.dataFrameChunk.enabled\nto\ntrue\nto ensure the\nfrom_spark()\nfunction works as expected.\nNavigate to your cluster’s configuration page.\nClick the\nAdvanced Options\naccordion.\nClick the\nSpark\ntab.\nIn the\nSpark Config\ntextbox, enter\nspark.databricks.pyspark.dataFrameChunk.enabled true\nClick\nConfirm\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-create-more-new-jobs-than-the-limit-quota.json b/scraped_kb_articles/error-when-trying-to-create-more-new-jobs-than-the-limit-quota.json new file mode 100644 index 0000000000000000000000000000000000000000..a7cbd05eb12fa66ad9b7222420853c2428701127 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-create-more-new-jobs-than-the-limit-quota.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/error-when-trying-to-create-more-new-jobs-than-the-limit-quota", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIf you try to create more new jobs than the limit quota, you receive an error message.\n\"Exception: Error while running create_job_api: {'error_code': 'QUOTA_EXCEEDED', 'message': 'The quota for the number of jobs has been reached\"\nCause\nDatabricks sets a limit for the number of jobs that can be created through the UI or the Jobs API. Each cloud provider sets a resource limit (\nAWS\n|\nAzure\n|\nGCP\n).\nLimits are enforced to prevent excessive job creation that could impact system performance and resource allocation.\nSolution\nConfirm the amount of jobs you have in your workspace, then identify and delete the jobs you do not need. Follow your preferred identification and deletion process (whether via Databricks API or through the UI).\nNote\nIf you have a one-time job, use the\n/jobs/runs/submit\nendpoint instead of creating new jobs through the\n/jobs/create\nendpoint. This endpoint does not have a quota limit.\nExample code to confirm number of jobs\nThe following code uses\nDatabricks SDK for Python\n(\nAWS\n|\nAzure\n|\nGCP\n).  It will output the total number of saved jobs in a notebook cell.\nfrom databricks.sdk import WorkspaceClient\r\nfrom databricks.sdk.service import jobs\r\nfrom pyspark.sql import SparkSession\r\nfrom pyspark.sql.types import StructType, StructField, LongType, StringType\r\n\r\n# Initialize Spark session\r\nspark = SparkSession.builder.appName(\"DatabricksJobList\").getOrCreate()\r\n\r\nw =WorkspaceClient() # this client is for current workspace, no PAT needed\r\n\r\n# Get the list of jobs\r\njob_list = w.jobs.list()\r\n\r\n# Create a list to store job details\r\njobs_list = []\r\n\r\n# Extract required fields from each job and append to the list\r\nfor job in job_list:\r\n   jobs_list.append({\r\n\"created_time\": job.created_time,\r\n\"creator_user_name\": job.creator_user_name,\r\n\"name\": job.settings.name\r\n})\r\n\r\n# Define the schema for the DataFrame\r\nschema =StructType([\r\nStructField(\"created_time\", LongType(), True),\r\nStructField(\"creator_user_name\", StringType(), True),\r\nStructField(\"name\", StringType(), True)\r\n])\r\n# Create a DataFrame from the list of job details\r\ndf = spark.createDataFrame(jobs_list, schema) #dataframe of jobs\r\n# Display the DataFrame , uncomment below line to see the jobs list as a Dataframe\r\n#df.show(truncate=False)\r\n\r\ntotal_jobs = df.count() #\r\n\r\nprint('total number of jobs in workspace is {total_jobs}')\nIn general, Databricks recommends regularly monitoring and cleaning up outdated or unnecessary jobs to ensure that the workspace does not reach the quota limit." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-load-a-dataset-after-integrating-unity-catalog-metadata-with-power-bi.json b/scraped_kb_articles/error-when-trying-to-load-a-dataset-after-integrating-unity-catalog-metadata-with-power-bi.json new file mode 100644 index 0000000000000000000000000000000000000000..7cdfb7e10c4f60125b1f379c03bd4aed6d4b7015 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-load-a-dataset-after-integrating-unity-catalog-metadata-with-power-bi.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/error-when-trying-to-load-a-dataset-after-integrating-unity-catalog-metadata-with-power-bi", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen integrating Unity Catalog metadata with Power BI and attempting to load a dataset, you encounter the following error message.\nOne or more errors occurred. ODBC: ERROR [HY000] [Microsoft][DSI] (20039) Cannot store \"\".\"\".\"\".\"REMARKS\" value in temporary table without truncation. (Column metadata implied a maximum of 512 bytes, while provided value is xxx bytes).\nCause\nThe comment metadata in the dataset is too large to be handled by Power BI’s ODBC driver. By default, Power BI expects metadata to not exceed 512 bytes.\nSolution\nFirst, update the ODBC connection string to increase the byte limit in the\n.ini\nfile. Add the parameter\nMaxCommentLen=2048\nto the ODBC string. This change increases the comment length limit to 2048 bytes, allowing Power BI to store longer metadata without truncation.\nNote\nIt is good practice to keep column comments within 512 characters when possible.\nThen, use Power BI’s native query option in the UI to query the table directly. This bypasses some of the limitations in Power BI's regular data connectivity mode, ensuring that metadata issues are handled more gracefully. The following shows an example query in the native query field of Power BI’s UI." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-parse-xml-in-a-shared-mode-cluster-using-the-from_xmlfunction.json b/scraped_kb_articles/error-when-trying-to-parse-xml-in-a-shared-mode-cluster-using-the-from_xmlfunction.json new file mode 100644 index 0000000000000000000000000000000000000000..6e8e048068e04851b0859dd4732bd2146077f725 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-parse-xml-in-a-shared-mode-cluster-using-the-from_xmlfunction.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/error-when-trying-to-parse-xml-in-a-shared-mode-cluster-using-the-from_xmlfunction", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to parse XML by passing in a StructType using the\nfrom_xml()function\nin a shared mode cluster, you receive an error.\nError message example\n[PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: XXXXX\r\nFile , line 11\r\n      1 bookschema = StructType([\r\n      2 StructField(\"_id\", StringType(), True),\r\n      3 StructField(\"author\", StringType(), True),\r\n      4 StructField(\"title\", StringType(), True),\r\n      5 ])\r\n      7 parsed_df = (\r\n      8 raw_df\r\n      9 .withColumn(\"parsedxml\", from_xml(raw_df.bookxmlstr, bookschema))\r\n     10 )\r\n---> 11 parsed_df.display()\nCause\nStructType is not supported in shared cluster mode.\nSolution\nDefine your desired XML schema in a Data Definition Language (DDL) string instead of a StructType, and then pass it to\nfrom_xml()\nfunction. You can use the following example. Make sure the fields, tags, and attribute names in the DDL align with your XML file.\nExample\nThe following code snippet defines rows with IDs and an XML string, then creates a list of tuples where each tuple is the row ID and XML string. It then creates a PySpark DataFrame from the list of tuples and defines a schema as a DDL string. Last, it parses the XML column and displays the parsed DataFrame.\nfrom pyspark.sql.types import StructField, StructType, StringType\r\nfrom pyspark.sql.functions import from_xml\r\n\r\n# SAMPLE XML STRINGS (replace with your own XML content)\r\n\r\nxml_data_1 = \"\"\"\r\n\">\r\n    Some Value\r\n    Another Value\r\n\r\n\"\"\"\r\n\r\nxml_data_2 = \"\"\"\r\n\">\r\n    Different Value\r\n    More Data\r\n\r\n\"\"\"\r\n\r\n\r\n# Each tuple is (, )\r\n\r\ndata_list = [\r\n    (\"\", xml_data_1),\r\n    (\"\", xml_data_2),\r\n]\r\n\r\n# Create a PySpark DataFrame from the list of tuples\r\n\r\nraw_df = spark.createDataFrame(data_list, [\"\", \"\"])\r\n\r\n\r\n# Define schema as a DDL string\r\n# This describes the expected fields in the XML and their types\r\n\r\nddl_schema_string = \" , , \"\r\n\r\n# Parse the XML column using 'from_xml'\r\nparsed_df = (\r\n    raw_df\r\n    .withColumn(\"parsedxml\", from_xml(raw_df[\"\"], ddl_schema_string))\r\n)\r\n\r\n# Display the parsed DataFrame\r\nparsed_df.display()" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-perform-maintenance-on-a-delta-table.json b/scraped_kb_articles/error-when-trying-to-perform-maintenance-on-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..74c6f22dd893cec05df72c82920b1297b3709501 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-perform-maintenance-on-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/error-when-trying-to-perform-maintenance-on-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to perform maintenance operations on a Delta table, you encounter an error.\nio.delta.exceptions.ConcurrentTransactionException: A conflicting metadata domain com.databricks.liquid is added.\nThe error persists even though you are the only person running commands on the table.\nCause\nYou are running concurrent maintenance operations such as\nOPTIMIZE\n,\nANALYZE\n, or\nVACUUM\non the Delta table.\nIf you are the sole person performing these operations, it is likely due to the predictive optimization feature being enabled for their tables. Predictive optimization executes maintenance commands using a background cluster that isn’t visible in your UI.\nConfirm whether you have concurrent maintenance operations running by using the\nDESC HISTORY\ncommand on the Delta table. Check if another maintenance command was executed at the time of failure. If predictive optimization was responsible, the user and cluster associated with the command will differ from your username and cluster. The username will typically appear in a format such as\n11111111-aaaa-bbbb-cccc-dddddddddddd\n.\nSolution\nAvoid triggering manual maintenance commands.\nIf you need to manage operation manually, use the following code to disable the predictive optimization feature for specific schemas or catalogs.\nALTER SCHEMA DISABLE PREDICTIVE OPTIMIZATION;\r\nALTER DATABASE DISABLE PREDICTIVE OPTIMIZATION;\nFor more information, review the\nPredictive optimization for Unity Catalog managed tables\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-read-a-notebook-from-another-notebook-in-the-same-workspace.json b/scraped_kb_articles/error-when-trying-to-read-a-notebook-from-another-notebook-in-the-same-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..f508fb0a9cb179e81523396183d1c854fcc09743 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-read-a-notebook-from-another-notebook-in-the-same-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/error-when-trying-to-read-a-notebook-from-another-notebook-in-the-same-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to read the contents of one notebook from another notebook within the same Databricks workspace, you receive an error message.\nAttempt to read file at path: /Workspace/Users//Development/your/filepath\r\nResult: [Errno 2] No such file or directory: '/Workspace/Users//Development/your/filepath'\nThis issue arises when using the following Python function.\nwith open(, 'r') as file:\nCause\nThe workspace file system does not support direct file operations using standard Python I/O functions like\nopen()\n.\nSolution\nUse the Databricks SDK to read the contents of a notebook.\nFirst, install the Databricks SDK.\npip install databricks-sdk\nThen, use the following code to read the notebook contents. This code uses the Databricks SDK to export the notebook content and decode it from base64 format.\nfrom databricks.sdk import WorkspaceClient\r\nfrom databricks.sdk.service import workspace\r\nimport base64\r\n\r\ndef notebook_reader(file_location):\r\n    databricksURL = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None)\r\n    myToken = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)\r\n\r\n    w = WorkspaceClient(host=, token=)\r\n    export_response = w.workspace.export(, format=workspace.ExportFormat.SOURCE)\r\n    notebook_content = base64.b64decode(export_response.content).decode('utf-8')\r\n    return notebook_content\r\n\r\nfile_location = \"\"\r\nnotebook_content = notebook_reader(file_location)\r\nprint(notebook_content)" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-read-data-from-mongodb.json b/scraped_kb_articles/error-when-trying-to-read-data-from-mongodb.json new file mode 100644 index 0000000000000000000000000000000000000000..031997428ed35b0af1d13bfe40bac6500ae2107b --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-read-data-from-mongodb.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/error-when-trying-to-read-data-from-mongodb", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to read data from MongoDB in an all-purpose compute with standard (formerly shared) access mode, you receive an error message.\n`org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges: User does not have permission SELECT on any file. SQLSTATE: 42501`.\nThis issue is specific to the MongoDB connector and does not occur when using a compute with dedicated access mode and the same Databricks Runtime version and package.\nCause\nThe use of standard access mode compute and the MongoDB connector are not governed by Unity Catalog (UC).\nWhen using Unity Catalog-enabled standard access mode compute or SQL warehouses to access storage paths or data sources not governed by UC, it invokes an evaluation of privileges on the\n`ANY FILE`\nsecurable.\nYou do not have privileges on the\n`ANY FILE`\nsecurable, resulting in the error message.\nSolution\nUse Lakehouse Federation to access external data sources, which does not require privileges on the\n`ANY FILE`\nsecurable. For more information, review the\nWhat is Lakehouse Federation?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf Lakehouse Federation is not an option, use a dedicated access mode compute instead of standard access mode.\nTo prevent similar issues in the future, ensure that the necessary privileges are granted when using custom data sources or JDBC drivers not included in Lakehouse Federation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-use-apache-sparks-pyspark-offset-method-on-dataframes-with-serverless-compute.json b/scraped_kb_articles/error-when-trying-to-use-apache-sparks-pyspark-offset-method-on-dataframes-with-serverless-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..60b7957f24193e5d1c296e6b7d50b118a25ce4c3 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-use-apache-sparks-pyspark-offset-method-on-dataframes-with-serverless-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/error-when-trying-to-use-apache-sparks-pyspark-offset-method-on-dataframes-with-serverless-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to use Apache Spark’s Pyspark\noffset\nmethod on DataFrames using serverless compute, you receive an error message.\nUnsupportedOperationException: Do not support SQL OFFSET with HybridCloudStoreFormat\nExample\nThis code creates a DataFrame from a list of the provided data, attempts to use the offset method to skip the first 2 rows, and then displays the resulting DataFrame.\nWhen run on a classic compute cluster, the example code runs without issue. When run on serverless compute, the code results in an error when the offset method is called.\n%python\r\n\r\nfrom pyspark.sql import SparkSession\r\n\r\n# create a dataframe\r\ndata = [(\"1\", \"a\"), (\"2\", \"b\"), (\"3\", \"c\"), (\"4\", \"d\"), (\"5\", \"e\")]\r\ndf = spark.createDataFrame(data, [\"id\", \"val\"])\r\n\r\n# use the offset method\r\ndf_offset = df.orderBy(\"id\").offset(2)\r\ndf_offset.show()\nCause\nThe\noffset\nmethod is not supported in serverless compute due to the storage format used and underlying architecture.\nThe optimized storage format,\nHybridCloudStoreFormat\nis designed for distributed computing.\nHybridCloudStoreFormat\nrelies on the cluster's ability to maintain a stateful connection to the data source, which is not possible in serverless compute due to its stateless task execution.\nBecause serverless compute executes tasks in a stateless manner, the connection to the data source is established and closed for each task.\nSolution\nYou should choose an alternative function to achieve similar results.\nFor small datasets\nUse the\nlimit\nmethod to achieve similar results to the\noffset\nmethod.\nInfo\nAlthough the\nlimit\nmethod is efficient and easy to use, it only returns a limited number of rows from the beginning of the DataFrame and is less flexible than\noffset\n.\nExample using limit\n%python\r\n\r\ndf1 = spark.createDataFrame(data, [\"id\", \"val\"])\r\ndfNew = df1.orderBy(\"id\").limit(1000).filter(df1.id > 5)\r\ndfNew.show()\nFor large datasets with pagination\nUse the\nmonotonically_increasing_id()\nlibrary function. Creating a unique ID for each row allows you to filter based on that ID for efficient pagination, and ensures consistent results in the presence of a high write throughput.\nExample using\nmonotonically_increasing_id()\n%python\r\n\r\nfrom pyspark.sql.functions import monotonically_increasing_id, col\r\ndf = spark.createDataFrame(data, [\"id\", \"val\"])\r\n\r\n# Add a unique identifier to each row\r\ndf_with_id = df.withColumn(\"row_id\", monotonically_increasing_id())\r\n\r\n# Simulate offset by filtering based on the row_id\r\noffset_value = 2\r\ndf_offset = df_with_id.filter(col(\"row_id\") >= offset_value).drop(\"row_id\")\r\ndf_offset.show()\nHelpjuice Info Callout Title\nWhen choosing between classic compute and serverless compute for your jobs, you should select the compute that is best suited for your specific task. For more information, review the\nServerless compute limitations\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-use-rdd-code-in-shared-clusters.json b/scraped_kb_articles/error-when-trying-to-use-rdd-code-in-shared-clusters.json new file mode 100644 index 0000000000000000000000000000000000000000..dc86c14c4f876da97fc76bb9b0f629b8832d9f97 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-use-rdd-code-in-shared-clusters.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/error-when-trying-to-use-rdd-code-in-shared-clusters", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use Resilient Distributed Dataset (RDD) code in a shared cluster, you receive an error.\nError: Method public org.apache.spark.rdd.RDD\r\norg.apache.spark.api.java.JavaRDD.rdd() is not allowlisted on class class org.apache.spark.api.java.JavaRDD\nCause\nDatabricks Runtime versions with Unity Catalog enabled do not support RDDs on shared clusters.\nSolution\nUse a single-user cluster instead, which supports RDD functionality.\nIf you want to continue using a shared cluster, use the DataFrame API instead of the RDD API. For example, you can use\nspark.createDataFrame\nto create DataFrames.\nFor more information on creating DataFrames, refer to the Apache Spark\npyspark.sql.SparkSession.createDataFrame\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-when-trying-to-use-vectorassembler-while-on-a-standard-access-mode-cluster-with-a-non-ml-databricks-runtime.json b/scraped_kb_articles/error-when-trying-to-use-vectorassembler-while-on-a-standard-access-mode-cluster-with-a-non-ml-databricks-runtime.json new file mode 100644 index 0000000000000000000000000000000000000000..2cd2e14b4a2447328500d6627c56609b9c82f0a2 --- /dev/null +++ b/scraped_kb_articles/error-when-trying-to-use-vectorassembler-while-on-a-standard-access-mode-cluster-with-a-non-ml-databricks-runtime.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/error-when-trying-to-use-vectorassembler-while-on-a-standard-access-mode-cluster-with-a-non-ml-databricks-runtime", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWorking in a notebook attached to a standard (formerly shared) access mode cluster with a non-ML Databricks Runtime, you’re attempting to use VectorAssembler to combine multiple feature columns into a single feature vector. You encounter the following error.\nPy4JError: An error occurred while calling None.org.apache.spark.ml.feature.VectorAssembler. Trace:\r\npy4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not allowlisted.\nCause\nApache Spark Machine Learning Library (MLlib) is not supported in UC-enabled clusters with standard access mode.\nSolution\nUse a Dedicated (formerly single user) access mode cluster, assigned to a group of users. This feature is currently in public preview, so it needs to be enabled at the workspace level by a workspace admin.\nClick the user icon at the top right of your workspace page, then click\nPreviews\n.\nClick the\nCompute: Dedicated group clusters\npreview toggle to turn on.\nAfter that, you are able to create a dedicated cluster for a group in your workspace. Follow the steps in the\nAssign compute resources to a group\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-while-calling-o430-mount-and-the-executor-has-failed-to-refresh-mounts.json b/scraped_kb_articles/error-while-calling-o430-mount-and-the-executor-has-failed-to-refresh-mounts.json new file mode 100644 index 0000000000000000000000000000000000000000..af2ba92fddaa6be7e57823d6925a679e6497cd82 --- /dev/null +++ b/scraped_kb_articles/error-while-calling-o430-mount-and-the-executor-has-failed-to-refresh-mounts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/error-while-calling-o430-mount-and-the-executor-has-failed-to-refresh-mounts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to run\ndbutils.fs.unmount(mounting_point)\nover a notebook using an all-purpose cluster, you receive an error message.\nAnswer: ExecutionError: An error occurred while calling o430.mount.\r\n: java.lang.Exception: Executor 58 has failed to refresh mounts. Reads and writes can be inconsistent. Call refreshMounts(). If the problem persists after a few minutes, try restarting the cluster, or contact Databricks support.\nCause\nMounting and unmounting operations are not atomic or transactional, which can lead to executor inconsistencies while refreshing the mount status.\nSolution\nUse the command\ndbutils.fs.refreshMounts()\nbefore a mount operation to ensure that all executors are updated.\nFollow the error message, call\nrefreshMounts\n, and do not call mount/unmount operations again if the error occurs.\nImplement the following code logic to handle the failure.\nfrom time import sleep\r\ntry:\r\n   dbutils.fs.unmount(mount_path)\r\nexcept Exception as e:\r\n   if \"failed to refresh mounts\" in str(e).lower():\r\n       dbutils.fs.refreshMounts()\r\n       print('Refreshing mounts... waiting for 3 min...')\r\n       sleep(180)\r\n   else:\r\n       # If the error is different, raise it or handle it differently\r\n       raise e" +} \ No newline at end of file diff --git a/scraped_kb_articles/error-while-establishing-user-sessions-to-oracle-database-through-an-external-hive-metastore.json b/scraped_kb_articles/error-while-establishing-user-sessions-to-oracle-database-through-an-external-hive-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..998433870038da6061c88aa0cc359b211a668b2a --- /dev/null +++ b/scraped_kb_articles/error-while-establishing-user-sessions-to-oracle-database-through-an-external-hive-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/error-while-establishing-user-sessions-to-oracle-database-through-an-external-hive-metastore", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen establishing a connection to an Oracle database through an external Hive metastore, you receive an error message,\n“exceeded simultaneous SESSIONS_PER_USER limit”\n.\nYou may notice a high number of Databricks compute resources establishing sessions to the external Hive metastore, persisting connections even when the cluster scales down to a minimal configuration, and/or connections remaining active until a cluster is terminated.\nCause\nYou have more simultaneous sessions per user than the Oracle database allows.\nSolution\nIncrease the\nSESSIONS_PER_USER\nlimit on your Oracle database.\nOracle recommends the same action\n. To increase the session limit per user, execute the below query.\nALTER PROFILE your_profile_name LIMIT SESSIONS_PER_USER new_limit;\nIf increasing the session limit on your Oracle database directly is not possible, use an Apache Spark cluster configuration to set your pool size.\nspark.databricks.hive.metastore.client.pool.size {max_numer_sessions}.\nImportant\nLimiting the connection pool size may impact performance, especially in multithreaded situations.\nNote\nWhen possible, Databricks recommends using Unity Catalog as the primary metadata store for your workloads. This can help reduce the dependency on the external Hive metastore and minimize the number of connections required. For more information, review the\nWhat is Unity Catalog?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAdditional preventative measures\nIf you are using Databricks Runtime versions below 10.4 LTS, and have the Spark setting\nSpark.databricks.hive.metastore.client.pool.type\nyou are likely using BoneCP. Consider using HikariCP for connection pooling instead. To set up HikariCP, use the Spark setting\nspark.databricks.hive.metastore.client.pool.type HikariCP\n. HikariCP is the default option as of Databricks Runtime 10.4 LTS. For more information, review the\nDatabricks Runtime 10.4 LTS\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nRegularly monitor the number of connections to the external Hive metastore and terminate idle connections. This can help reduce resource contention and improve the overall performance of the system.\nImplement a cluster management strategy that terminates clusters when they are no longer in use. This can help reduce the number of active connections to the external Hive metastore and free up resources for other workloads." +} \ No newline at end of file diff --git a/scraped_kb_articles/error-while-using-the-unstructured-library.json b/scraped_kb_articles/error-while-using-the-unstructured-library.json new file mode 100644 index 0000000000000000000000000000000000000000..7c83269bee728ef2028ffb295a6563884b0e0377 --- /dev/null +++ b/scraped_kb_articles/error-while-using-the-unstructured-library.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/error-while-using-the-unstructured-library", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the Unstructured library in your workspace to extract content from PDF files, you encounter the following error.\nPDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?\nCause\nYou are missing the system-level dependency, Poppler.\nThe Unstructured library internally uses the\npdf2image\nPython package to process PDF files.\nPdf2image\nrelies on a command-line tool called\npdfinfo\n, which is part of the Poppler utility suite, to extract metadata like page count and layout.\nHowever, Poppler is not a Python package and cannot be installed using\n%pip install\n. The required tool,\npdfinfo\n, comes from the\npoppler-utils\nLinux package, which must be installed using a system package manager like\napt-get\n.\nIf\npoppler-utils\nis not installed or not available in the system PATH,\npdf2image\nwill raise a\nPDFInfoNotInstalledError\n.\nSolution\nInstall pdf2image using %pip or the Libraries UI\nFirst ensure the Python library pdf2image is correctly installed in your notebook or cluster environment.\nUsing a notebook\n%pip install pdf2image\nUsing the Libraries UI\nGo to\nCompute > Your Cluster > Libraries > Install New\nSelect\nPyPI\n, and in the package field, enter “pdf2image”\nCreate an init script to install poppler-utils\nInstall system-level dependencies like Poppler using init scripts, which are executed automatically on cluster startup.\n1. Use the workspace file browser to\ncreate a new file\n(\nAWS\n|\nAzure\n|\nGCP\n) in your home directory. Call it\ninstall_poppler.sh\n.\n2. Copy the following sample script and paste it into the\ninstall_poppler.sh\nfile you just created:\n#!/bin/bash\r\n\r\nsudo apt-get update && sudo apt-get install -y poppler-utils\r\n\r\n# (Optional) Install OCR engine used in some PDF workflows\r\nsudo apt-get install -y tesseract-ocr\n3. Your init script is located at\n/Workspace/Users//install_poppler.sh\n.  Remember the path to the init script. You will need it when configuring your cluster.\nConfigure the init script on your cluster\nGo to\nCompute > Your Cluster > Advanced Options > Init Scripts\nEnter the file path\n/Workspace/Users//install_poppler.sh\nClick\nAdd\n.\nClick\nConfirm\nand then restart the cluster to apply the script.\nFor more information, refer to the\nCluster-scoped init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/etl-jobs-failing-in-databricks-sql-when-casting-oracle-number-columns-to-delta-tables.json b/scraped_kb_articles/etl-jobs-failing-in-databricks-sql-when-casting-oracle-number-columns-to-delta-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..1911ab0b16d92aa1ca126fae2b526105f97c3945 --- /dev/null +++ b/scraped_kb_articles/etl-jobs-failing-in-databricks-sql-when-casting-oracle-number-columns-to-delta-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/etl-jobs-failing-in-databricks-sql-when-casting-oracle-number-columns-to-delta-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile working in a Databricks SQL environment, your ETL jobs fail when Oracle\nNUMBER\ncolumns with precision up to 40 digits are cast to\nDECIMAL(38,8)\nin Delta tables.\nCause\nThe numeric value exceeds the representable limit for the specified precision and scale.\nDatabricks SQL strictly enforces the precision and scale limits of the\nDECIMAL\ntype. The maximum value that can be stored in\nDECIMAL(38,8)\nmust not exceed 999999999999999999999999999999.99999999 (30 digits before the decimal, and eight after the decimal).\nWhen the system encounters values with more than 38 digits of precision, it throws an error due to overflow. Overflow is common when ingesting Oracle\nNUMBER\ntypes with high precision.\nSolution\nTo handle such large values without precision loss or overflow, store the affected columns as type\nSTRING\nin the Delta table.\nThen, register a UDF using Python's decimal module to perform arithmetic with extended precision. This approach allows accurate computations on values exceeding\nDECIMAL(38,8)\nlimits. You can modify the following example code for registering the UDF.\n```python\r\nfrom decimal import Decimal, getcontext\r\nfrom pyspark.sql.types import StringType\r\n\r\n# Register the UDF for high-precision arithmetic\r\ndef udf_calculate_high_precision(col_a, col_b, col_c, col_d):\r\n    getcontext().prec = 50\r\n    a = Decimal(col_a)\r\n    b = Decimal(col_b)\r\n    c = Decimal(col_c)\r\n    d = Decimal(col_d)\r\n    result = (a - b) + (c - d)\r\n    return str(result)\r\n\r\nspark.udf.register(\"udf_calculate_high_precision\", udf_calculate_high_precision, StringType())\r\n```\nThe following code subsequently uses the registered UDF in SQL.\n```sql\r\nSELECT\r\n  udf_calculate_high_precision(\r\n    t.test_col_a,\r\n    t.test_col_b,\r\n    t.test_col_c,\r\n    t.test_col_d\r\n  ) AS calculated_result\r\nFROM test_db.test_table t\r\nWHERE t.test_filter_col = 'test_value';\r\n```\nAs an additional consideration, be sure to clearly document column transformations when shifting from Oracle to Delta Lake." +} \ No newline at end of file diff --git a/scraped_kb_articles/etl-process-fails-to-process-a-column-and-throws-error-row-group-size-has-overflowed.json b/scraped_kb_articles/etl-process-fails-to-process-a-column-and-throws-error-row-group-size-has-overflowed.json new file mode 100644 index 0000000000000000000000000000000000000000..f5460339768e8161abb3298e0ce4d00bd3d0feea --- /dev/null +++ b/scraped_kb_articles/etl-process-fails-to-process-a-column-and-throws-error-row-group-size-has-overflowed.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/etl-process-fails-to-process-a-column-and-throws-error-row-group-size-has-overflowed", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter upgrading to Databricks Runtime 14.3 LTS, you try to process a column containing a large amount of unparsed data. The extract, transform, load (ETL) process that handles large data batches fails with the error\nRow group size has overflowed\n.\nCause\nThe row group size exceeds the maximum allowable limit for Parquet files. Databricks Runtime 14.3 LTS and above have stricter or more frequent checks for row group size, causing this error to surface.\nSolution\nNavigate to your cluster.\nClick\nAdvanced options\n.\nIn the\nSpark config\nbox under the\nSpark\ntab, add the following configuration settings to reduce the default row group size and increase the size check frequency.\nspark.hadoop.parquet.page.size.row.check.max  5\nspark.hadoop.parquet.block.size.row.check.max  5\nspark.hadoop.parquet.page.size.row.check.min  5\nspark.hadoop.parquet.block.size.row.check.min  5\nThese settings lower the thresholds for size checks from\n10\n, the default, to\n5\n, ensuring that large rows are detected and handled before causing an overflow. If you still face the issue, please lower the settings further." +} \ No newline at end of file diff --git a/scraped_kb_articles/execution-error-when-trying-to-mount-a-storage-account.json b/scraped_kb_articles/execution-error-when-trying-to-mount-a-storage-account.json new file mode 100644 index 0000000000000000000000000000000000000000..f30a31321660eb47a0e9bc3bf851daaf9802d60a --- /dev/null +++ b/scraped_kb_articles/execution-error-when-trying-to-mount-a-storage-account.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/execution-error-when-trying-to-mount-a-storage-account", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to mount a storage account in Databricks, you receive the following error.\nExecutionError: An error occurred while calling .\r\n: com.databricks.backend.daemon.data.common.InvalidMountException: Error while using path for creating file system within mount at ''.\r\nCaused by: com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token\nCause\nYou’re using nested mounts, where one mount is created within another. Nested mounts are not supported.\nExample of nested mounts\ndbutils.fs.mount(\r\n    source=\"abfss://container1@storageaccount.dfs.core.windows.net\",\r\n    mount_point=\"/mnt/storage1\"\r\n)\r\n\r\n# Now, trying to mount another storage inside /mnt, which is already in use:\r\ndbutils.fs.mount(\r\n    source=\"abfss://container2@storageaccount.dfs.core.windows.net\",\r\n    mount_point=\"/mnt/storage1/storage2\"\r\n)\nSolution\nFirst, unmount the incorrect nested mount.\ndbutils.fs.unmount(\"/mnt\")\nThen, remount each mount independently, at separate locations.\ndbutils.fs.mount(\r\n    source=\"\",\r\n    mount_point=\"/mnt/storage1\"\r\n)\r\ndbutils.fs.mount(\r\n    source=\"\",\r\n    mount_point=\"/mnt/storage2\"\r\n)" +} \ No newline at end of file diff --git a/scraped_kb_articles/executionexception-error-when-trying-to-use-conda-as-an-environment-manager-in-mlflow.json b/scraped_kb_articles/executionexception-error-when-trying-to-use-conda-as-an-environment-manager-in-mlflow.json new file mode 100644 index 0000000000000000000000000000000000000000..d76eb6e226b9cf6fbf0434dc45fb87d16926523e --- /dev/null +++ b/scraped_kb_articles/executionexception-error-when-trying-to-use-conda-as-an-environment-manager-in-mlflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/executionexception-error-when-trying-to-use-conda-as-an-environment-manager-in-mlflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use Conda as your environment manager in MLflow on Databricks Runtime 13.x ML and above, you encounter the following\nExecutionException\nerror. This problem does not occur in Databricks Runtime 12.2 ML LTS.\n`ExecutionException: Could not find Conda executable at /databricks/conda/bin/conda. Ensure Conda is installed as per the instructions at . You can also configure MLflow to look for a specific Conda executable by setting the MLFLOW_CONDA_HOME environment variable to the path of the Conda executable.`\nCause\nMLflow is still configured to use Conda as your environment manager to manage dependencies, but it is no longer available. The miniconda package has been removed from Databricks Runtime 13.0 ML and above.\nSolution\nSwitch your environment manager for MLflow to virtualenv in your workflow or notebook. Update your\nenv_manager\nvariable from\n“conda”\nto\n“virtualenv”\n.\nExample\nmlflow.pyfunc.spark_udf(\r\n    model_uri=model_uri,\r\n    result_type=\"double\",\r\n    env_manager=\"virtualenv\"\r\n)" +} \ No newline at end of file diff --git a/scraped_kb_articles/expensive-transformation-on-dataframe-is-recalculated-even-when-cached.json b/scraped_kb_articles/expensive-transformation-on-dataframe-is-recalculated-even-when-cached.json new file mode 100644 index 0000000000000000000000000000000000000000..f49c65a220fd57c6153a2c3717ff3789601b8733 --- /dev/null +++ b/scraped_kb_articles/expensive-transformation-on-dataframe-is-recalculated-even-when-cached.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/expensive-transformation-on-dataframe-is-recalculated-even-when-cached", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nApache Spark DataFrame caching is used to prevent redundant computations during a complex ETL process. When you attempt to reuse a cached DataFrame, the actions trigger a full recomputation of the transformation instead of using the cached information.\nCause\nThere are several reasons why a cached DataFrame can be invalidated, leading to unexpected recalculations.\nWhen you call\ndf.cache()\nthe DataFrame is not immediately cached. The caching occurs when an action like\ndf.count()\n,\ndf.show()\n, etc. is executed. If an action is not triggered after calling\ndf.cache()\n, the DataFrame is not cached.\nWhen a DataFrame is cached on cluster A and the underlying data source is updated (\nINSERT\n/\nUPDATE\n/\nMERGE\n/\nDELETE\n) using the same cluster, the cache is invalidated and recomputed. If the underlying data source is updated using another cluster (cluster B), the cache on the original cluster (cluster A) is not affected. In this scenario, the cached DataFrame continues to be used, but it reflects the state of the data as it was before the external update.\nBy default, Spark caches data using the\nMEMORY_AND_DISK\nstorage level. In this mode, if there is insufficient memory for live task computations, Spark automatically moves the cached data from memory to disk when the memory is full. But if the storage level is set to\nMEMORY_ONLY\n, cache data could be evicted when the memory is full. In this case, a lazy re-calculation of the cache data would occur on the next access.\nPerforming any non-idempotent transformations on a cached DataFrame may lead to recomputation. This is because the logical plan of DataFrame currently being executed may mismatch with the cached logical plan leading to a cache miss and triggers a computation without using the cached value.\nInfo\nOn Databricks Runtime 11.3 LTS and below, DataFrames cached against Delta table sources will not invalidate when the source table is overwritten and hence an older image will be used. This behavior remains the same even when the same cluster is used for overwrite. In this case, users must manually re-cache.\nOn Databricks Runtime 12.2 LTS and above, DataFrames cached against Delta table sources can be invalidated when the source table is overwritten. They are invalidated when the source is overwritten using the same cluster and will not be invalidated when the source is overwritten on a different cluster.\nReview the\nDatabricks Runtime maintenance updates\n(\nAzure\n|\nAWS\n|\nGCP\n) for more information.\nSolution\nIf your data and transformations are complex enough to cause unwanted re-caching you should consider using a different Spark persistence solution.\nMaterialize intermediate complex results into temporary Delta tables and drop them after your ETL is finished. This allows Spark to run the complex transformation once and write the results into a Delta table, which acts as temporary immutable persistent storage. Read the Delta table to get the calculated results.\nUse a\nSpark DataFrame checkpoint\nto write the DataFrame into a reliable HDFS storage location where it can persist without being affected by underlying changes." +} \ No newline at end of file diff --git a/scraped_kb_articles/experiencing-an-exception-indicating-yaml-file-exists-as-referring-to-metayaml-file-when-creating-an-mlflow-experiment.json b/scraped_kb_articles/experiencing-an-exception-indicating-yaml-file-exists-as-referring-to-metayaml-file-when-creating-an-mlflow-experiment.json new file mode 100644 index 0000000000000000000000000000000000000000..8986de7b5c5b35bf0bc30858574d2d448e38384d --- /dev/null +++ b/scraped_kb_articles/experiencing-an-exception-indicating-yaml-file-exists-as-referring-to-metayaml-file-when-creating-an-mlflow-experiment.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/experiencing-an-exception-indicating-yaml-file-exists-as-referring-to-metayaml-file-when-creating-an-mlflow-experiment", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating an MLFlow experiment for feature engineering, you encounter an issue with the\nmeta.yaml\nfile. You receive an error message.\nException: Yaml file '/Workspace/Users/.bundle//feature_engineering/features/mlruns/0/meta.yaml' exists as '/Workspace/Users/.bundle//feature_engineering/features/mlruns/0/meta.yaml'\nCause\nWhen MLflow is attempting to create a new experiment, it expects to write a fresh\nmeta.yaml\nfile. If the file already exists, especially if it contains data from a previous, incomplete, or corrupted run, MLflow raises an exception to prevent accidental overwriting.\nSolution\n1. Use the below configuration in your code to set a tracking URI. This command properly directs MLflow to the Databricks-hosted tracking server and ensures conflicts within the local file structure are corrected.\nmlflow.set_tracking_uri(\"databricks\")\n2. Ensure you have set\nDATABRICKS_HOST\nand\nDATABRICKS_TOKEN\nenvironment variables.\nFor more information, please refer to the “Where MLflow runs are logged” section of the\nTrack model development using MLflow\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAlternatively, remove the corrupted\nmeta.yaml\nfile using the following command in a notebook.\n%sh rm -rf “/Workspace/Users/.bundle//feature_engineering/features/mlruns/0/meta.yaml”" +} \ No newline at end of file diff --git a/scraped_kb_articles/explicit-path-to-data-or-a-defined-schema-required-for-auto-loader.json b/scraped_kb_articles/explicit-path-to-data-or-a-defined-schema-required-for-auto-loader.json new file mode 100644 index 0000000000000000000000000000000000000000..cb129d801bcd0fbde3c4660725c251f9e1946514 --- /dev/null +++ b/scraped_kb_articles/explicit-path-to-data-or-a-defined-schema-required-for-auto-loader.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/explicit-path-to-data-or-a-defined-schema-required-for-auto-loader", + "title": "Título do Artigo Desconhecido", + "content": "Delete\nInfo\nThis article applies to Databricks Runtime 9.1 LTS and above.\nProblem\nYou are using Auto Loader to ingest data for your ELT pipeline when you get an\nIllegalArgumentException: Please provide the source directory path with option `path`\nerror message.\nYou get this error when you start an Auto Loader job, if either the path to the data or the data schema is not defined.\nError:\r\nIllegalArgumentException                 Traceback (most recent call last)\r\n in \r\n     1 df = (\r\n----> 2    spark\r\n     3    .readStream.format(\"cloudFiles\")\r\n     4    .options(**{\r\n     5        \"cloudFiles.format\": \"csv\",\r\n/databricks/spark/python/pyspark/sql/streaming.py in load(self, path, format, schema, **options)\r\n   480            return self._df(self._jreader.load(path))\r\n   481        else:\r\n--> 482            return self._df(self._jreader.load())\r\n   483\r\n   484    def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,\r\n/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)\r\n  1302\r\n  1303        answer = self.gateway_client.send_command(command)\r\n-> 1304        return_value = get_return_value(\r\n  1305            answer, self.gateway_client, self.target_id, self.name)\r\n  1306\r\n/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)\r\n   121                # Hide where the exception came from that shows a non-Pythonic\r\n   122                # JVM exception message.\r\n--> 123                raise converted from None\r\n   124            else:\r\n   125                raise\r\nIllegalArgumentException: Please provide the source directory path with option `path`\nCause\nAuto Loader requires you to provide the path to your data location, or for you to define the schema. If you provide a path to the data, Auto Loader attempts to infer the data schema. If you do not provide the path, Auto Loader cannot infer the schema and requires you to explicitly define the data schema.\nFor example, if a value for\n\nis not included in this sample code, the error is generated when you start your Auto Loader job.\n%python\r\n\r\ndf = spark.readStream.format(\"cloudFiles\") \\\r\n.option(, ) \\\r\n.load()\nIf a value for\n\nis included in this sample code, the Auto Loader job can infer the schema when it starts and will not generate the error.\n%python\r\n\r\ndf = spark.readStream.format(\"cloudFiles\") \\\r\n.option(, ) \\\r\n.load()\nSolution\nYou have to provide either the path to your data or the data schema when using Auto Loader.\nIf you do not specify the path, then the data schema MUST be defined.\nFor example, this sample code has the data schema defined, but no path specified. Because the data schema was defined, the path is optional. This does not generate an error when the Auto Loader job is started.\n%python\r\n\r\ndf = spark.readStream.format(\"cloudFiles\") \\\r\n.option(, ) \\\r\n.schema() \\\r\n.load()" +} \ No newline at end of file diff --git a/scraped_kb_articles/explore-spark-metrics.json b/scraped_kb_articles/explore-spark-metrics.json new file mode 100644 index 0000000000000000000000000000000000000000..60b18058bbffcc332f61d13da89598cb21b332d2 --- /dev/null +++ b/scraped_kb_articles/explore-spark-metrics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metrics/explore-spark-metrics", + "title": "Título do Artigo Desconhecido", + "content": "Apache Spark provides several useful internal listeners that track metrics about tasks and jobs. During the development cycle, for example, these metrics can help you to understand when and why a task takes a long time to finish. Of course, you can leverage the Spark UI or History UI to see information for each task and stage, but there are some downsides. For instance, you can’t compare the statistics for two Spark jobs side by side, and the Spark History UI can take a long time to load for large Spark jobs.\nYou can extract the metrics generated by Spark internal classes and persist them to disk as a table or a\nDataFrame\n. Then you can query the\nDataFrame\njust like any other data science table.\nYou can use this\nSparkTaskMetrics package\nto explore how to use Spark listeners to extract metrics from tasks and jobs.\nBuild the Spark Metrics package\nUse the following command to build the package.\n%sh\r\n\r\nsbt package\nGather metrics\nImport\nTaskMetricsExplorer\n. Create the query\nsql(\"\"\"SELECT * FROM nested_data\"\"\").show(false)\nand pass it into\nrunAndMeasure\n. The query should include at least one Spark action in order to trigger a Spark job. Spark does not generate any metrics until a Spark job is executed.\nFor example:\n%scala\r\n\r\nimport com.databricks.TaskMetricsExplorer\r\n\r\nval t = new TaskMetricsExplorer(spark)\r\nsql(\"\"\"CREATE OR REPLACE TEMPORARY VIEW nested_data AS\r\n       SELECT id AS key,\r\n       ARRAY(CAST(RAND(1) * 100 AS INT), CAST(RAND(2) * 100 AS INT), CAST(RAND(3) * 100 AS INT), CAST(RAND(4) * 100 AS INT), CAST(RAND(5) * 100 AS INT)) AS values,\r\n       ARRAY(ARRAY(CAST(RAND(1) * 100 AS INT), CAST(RAND(2) * 100 AS INT)), ARRAY(CAST(RAND(3) * 100 AS INT), CAST(RAND(4) * 100 AS INT), CAST(RAND(5) * 100 AS INT))) AS nested_values\r\n       FROM range(5)\"\"\")\r\nval query = sql(\"\"\"SELECT * FROM nested_data\"\"\").show(false)\r\nval res = t.runAndMeasure(query)\nThe\nrunAndMeasure\nmethod runs the command and gets the task’s internal metrics using a Spark listener. It then runs the query and returns the result:\n+---+-------------------+-----------------------+\r\n|key|values             |nested_values          |\r\n+---+-------------------+-----------------------+\r\n|0  |[26, 11, 66, 8, 47]|[[26, 11], [66, 8, 47]]|\r\n|1  |[66, 8, 47, 91, 6] |[[66, 8], [47, 91, 6]] |\r\n|2  |[8, 47, 91, 6, 70] |[[8, 47], [91, 6, 70]] |\r\n|3  |[91, 6, 70, 41, 19]|[[91, 6], [70, 41, 19]]|\r\n|4  |[6, 70, 41, 19, 12]|[[6, 70], [41, 19, 12]]|\r\n+---+-------------------+-----------------------+\nThe task metrics information is saved in a\nDataFrame\n. You can display it with this command:\n%scala\r\n\r\nres.select($\"stageId\", $\"taskType\", $\"taskLocality\", $\"executorRunTime\", $\"duration\", $\"executorId\", $\"host\", $\"jvmGCTime\").show(false)\nThen you get:\n+-------+----------+-------------+---------------+--------+----------+---------+---------+\r\n|stageId|taskType  |taskLocality |executorRunTime|duration|executorId| host    |jvmGCTime|\r\n+-------+----------+-------------+---------------+--------+----------+---------+---------+\r\n|3      |ResultTask|PROCESS_LOCAL|2              |9       |driver    |localhost|0        |\r\n|4      |ResultTask|PROCESS_LOCAL|3              |11      |driver    |localhost|0        |\r\n|4      |ResultTask|PROCESS_LOCAL|3              |16      |driver    |localhost|0        |\r\n|4      |ResultTask|PROCESS_LOCAL|2              |20      |driver    |localhost|0        |\r\n|4      |ResultTask|PROCESS_LOCAL|4              |22      |driver    |localhost|0        |\r\n|5      |ResultTask|PROCESS_LOCAL|2              |12      |driver    |localhost|0        |\r\n|5      |ResultTask|PROCESS_LOCAL|3              |17      |driver    |localhost|0        |\r\n|5      |ResultTask|PROCESS_LOCAL|7              |21      |driver    |localhost|0        |\r\n+-------+----------+-------------+---------------+--------+----------+---------+---------+\nTo view all available metrics names and data types, display the schema of the\nres DataFrame\n:\n%scala\r\n\r\nres.schema.treeString\nroot\r\n |-- stageId: integer (nullable = false)\r\n |-- stageAttemptId: integer (nullable = false)\r\n |-- taskType: string (nullable = true)\r\n |-- index: long (nullable = false)\r\n |-- taskId: long (nullable = false)\r\n |-- attemptNumber: integer (nullable = false)\r\n |-- launchTime: long (nullable = false)\r\n |-- finishTime: long (nullable = false)\r\n |-- duration: long (nullable = false)\r\n |-- schedulerDelay: long (nullable = false)\r\n |-- executorId: string (nullable = true)\r\n |-- host: string (nullable = true)\r\n |-- taskLocality: string (nullable = true)\r\n |-- speculative: boolean (nullable = false)\r\n |-- gettingResultTime: long (nullable = false)\r\n |-- successful: boolean (nullable = false)\r\n |-- executorRunTime: long (nullable = false)\r\n |-- executorCpuTime: long (nullable = false)\r\n |-- executorDeserializeTime: long (nullable = false)\r\n |-- executorDeserializeCpuTime: long (nullable = false)\r\n |-- resultSerializationTime: long (nullable = false)\r\n |-- jvmGCTime: long (nullable = false)\r\n |-- resultSize: long (nullable = false)\r\n |-- numUpdatedBlockStatuses: integer (nullable = false)\r\n |-- diskBytesSpilled: long (nullable = false)\r\n |-- memoryBytesSpilled: long (nullable = false)\r\n |-- peakExecutionMemory: long (nullable = false)\r\n |-- recordsRead: long (nullable = false)\r\n |-- bytesRead: long (nullable = false)\r\n |-- recordsWritten: long (nullable = false)\r\n |-- bytesWritten: long (nullable = false)\r\n |-- shuffleFetchWaitTime: long (nullable = false)\r\n |-- shuffleTotalBytesRead: long (nullable = false)\r\n |-- shuffleTotalBlocksFetched: long (nullable = false)\r\n |-- shuffleLocalBlocksFetched: long (nullable = false)\r\n |-- shuffleRemoteBlocksFetched: long (nullable = false)\r\n |-- shuffleWriteTime: long (nullable = false)\r\n |-- shuffleBytesWritten: long (nullable = false)\r\n |-- shuffleRecordsWritten: long (nullable = false)\r\n |-- errorMessage: string (nullable = true)" +} \ No newline at end of file diff --git a/scraped_kb_articles/extract-feature-info.json b/scraped_kb_articles/extract-feature-info.json new file mode 100644 index 0000000000000000000000000000000000000000..8226499eab186d2eb723f1fe97ba0010e64025bb --- /dev/null +++ b/scraped_kb_articles/extract-feature-info.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/extract-feature-info", + "title": "Título do Artigo Desconhecido", + "content": "When you are fitting a tree-based model, such as a decision tree, random forest, or gradient boosted tree, it is helpful to be able to review the feature importance levels along with the feature names. Typically models in SparkML are fit as the last stage of the pipeline. To extract the relevant feature information from the pipeline with the tree model, you must extract the correct pipeline stage. You can extract the feature names from the\nVectorAssembler\nobject:\n%python\r\n\r\nfrom pyspark.ml.feature import StringIndexer, VectorAssembler\r\nfrom pyspark.ml.classification import DecisionTreeClassifier\r\nfrom pyspark.ml import Pipeline\r\n\r\npipeline = Pipeline(stages=[indexer, assembler, decision_tree)\r\nDTmodel = pipeline.fit(train)\r\nva = dtModel.stages[-2]\r\ntree = DTmodel.stages[-1]\r\n\r\ndisplay(tree) #visualize the decision tree model\r\nprint(tree.toDebugString) #print the nodes of the decision tree model\r\n\r\nlist(zip(va.getInputCols(), tree.featureImportances))\nYou can also tune a tree-based model using a cross validator in the last stage of the pipeline. To visualize the decision tree and print the feature importance levels, you extract the\nbestModel\nfrom the\nCrossValidator\nobject:\n%python\r\n\r\nfrom pyspark.ml.tuning import ParamGridBuilder, CrossValidator\r\n\r\ncv = CrossValidator(estimator=decision_tree, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)\r\npipelineCV = Pipeline(stages=[indexer, assembler, cv)\r\nDTmodelCV = pipelineCV.fit(train)\r\nva = DTmodelCV.stages[-2]\r\ntreeCV = DTmodelCV.stages[-1].bestModel\r\n\r\ndisplay(treeCV) #visualize the best decision tree model\r\nprint(treeCV.toDebugString) #print the nodes of the decision tree model\r\n\r\nlist(zip(va.getInputCols(), treeCV.featureImportances))\nThe display function visualizes decision tree models only. See Machine learning visualizations (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/extract-timestamps-with-precision-up-to-nano-seconds-from-a-long-column.json b/scraped_kb_articles/extract-timestamps-with-precision-up-to-nano-seconds-from-a-long-column.json new file mode 100644 index 0000000000000000000000000000000000000000..dbfe41ae9edae5bc4b32d187f4def398ecbf8764 --- /dev/null +++ b/scraped_kb_articles/extract-timestamps-with-precision-up-to-nano-seconds-from-a-long-column.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/extract-timestamps-with-precision-up-to-nano-seconds-from-a-long-column", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are working with Apache Spark timestamps when you encounter an issue where the nanosecond precision from\ndatetime\ncolumns is not retained. When converting a\nLongType\ncolumn to\nTimestampType\n, the nanosecond precision is lost, leading to inaccurate timestamps.\nCause\nSpark does not natively support\nTimestampType\nwith nanosecond precision, currently it only supports microsecond precision, which means that directly casting a\nLongType\ncolumn containing nanoseconds to\nTimeStampType\nresults in truncation.\nSolution\nTo retain nanosecond precision, you must create a User-Defined Function (UDF) that processes the\nLongType\ncolumn and converts it into a properly formatted string representation of the timestamp, keeping the nanosecond component intact.\nExample code (Scala)\nImport the required Libraries:\nimport java.time.{Instant, ZoneId}\r\nimport java.time.format.DateTimeFormatter\r\nimport org.apache.spark.sql.functions._\r\nimport org.apache.spark.sql.expressions.UserDefinedFunction\nDefine the UDF to format the timestamp with nanosecond precision.\nval formatInstantWithNanos: UserDefinedFunction = udf((seconds: Long, nanos: Long) => {\r\n  val instant = Instant.ofEpochSecond(seconds, nanos)\r\n  val formatter = DateTimeFormatter.ofPattern(\"yyyy-MM-dd HH:mm:ss.SSSSSSSSS\")\r\n    .withZone(ZoneId.of(\"\"))\r\n  formatter.format(instant)\r\n})\nApply the UDF to the DataFrame.\nval convertedDF = source_df.withColumn(\"\", \r\n  formatInstantWithNanos(\r\n    col(\"\") / 1000000000L,  // Seconds part\r\n    col(\"\") % 1000000000L   // Nanoseconds part\r\n  )\r\n)\nHere, the\nsource_df\nrepresents the original data frame that reads the raw data. The\n\nis the column that has the\nLongType\ndata from which we want to extract the timestamps with precision up to nanoseconds. Then, we create a new\nconvertedDF\nDataFrame with the\n\ncolumn, which has the timestamp with nanosecond precision in a\nStringType\nformat.\nFurther downstream operations on the\n\ncolumn should be done based on the\nStringType\noperations.\nUse this UDF in Spark when dealing with timestamps stored in\nLongType\nto ensure that the nanosecond component is preserved.\nInstead of converting directly to\nTimeStampType\n, store the formatted timestamp as a string and modify downstream applications to work with this format.\nIf further processing is needed, downstream applications can parse the string to retrieve the nanosecond precision timestamp.\nAvoid casting to TimestampType\nIf we convert/cast the\nStringType\ncolumn to a\nTimestampType\ncolumn, the nanosecond precision is lost, as Spark currently doesn’t support precision up to nanoseconds.\nIf we create a new column\n\nand cast the\n\ncolumn to a\nTimestampType\nthe nanoseconds are lost.\nExample code casting to TimestampType\nimport org.apache.spark.sql.functions._\r\nimport org.apache.spark.sql.types._\r\nvar dfWithTimestamp = convertedDF.withColumn(\"timestamp_col_casted\", col(\"formatted_datetime\").cast(TimestampType))\r\ndfWithTimestamp = dfWithTimestamp.select(\"formatted_datetime\",\"timestamp_col_casted\")\r\ndfWithTimestamp.show(false)\nExample result casting to TimestampType\nThe results show that the nanosecond precision is lost in the casted column.\n+----------------------------+----------------------------+\r\n|formatted_datetime          |timestamp_col_casted        |\r\n+----------------------------+----------------------------+\r\n|2017-07-14 08:10:00.123456789|2017-07-14 08:10:00.123456 |\r\n+----------------------------+----------------------------+\r\n\r\ndfWithTimestamp: org.apache.spark.sql.DataFrame = [formatted_datetime: string, timestamp_col_casted: timestamp]\r\ndfWithTimestamp: org.apache.spark.sql.DataFrame = [formatted_datetime: string, timestamp_col_casted: timestamp]" +} \ No newline at end of file diff --git a/scraped_kb_articles/fail-create-cluster-tag-limit.json b/scraped_kb_articles/fail-create-cluster-tag-limit.json new file mode 100644 index 0000000000000000000000000000000000000000..6774e1e69453b3a446efb161dfe345e326720831 --- /dev/null +++ b/scraped_kb_articles/fail-create-cluster-tag-limit.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/fail-create-cluster-tag-limit", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to create a cluster, but it is failing with an invalid tag value error message.\nSystem.Exception: Content={\"error_code\":\"INVALID_PARAMETER_VALUE\",\"message\":\"\\nInvalid tag value (<<<>>>) - the length cannot exceed 256\\nUnicode characters in UTF-8.\\n \"}\nCause\nLimitations on tag Key and Value are set by the cloud provider.\nAWS\nAWS tag keys must:\nContain 1-127 characters\nContain letters, spaces, numbers, or the characters\n+ - = . _ : / @\nNot start with\naws:\nNot duplicate an existing key\nAWS tag values must:\nContain 1-255 characters\nContain letters, spaces, numbers, or the characters\n+ - = . _ : / @\nNot start with\naws:\nFor more information, please refer to the AWS\ntag naming limits and requirements\ndocumentation.\nDelete\nAzure\nAzure tag keys must:\nContain 1-512 characters\nContain letters, numbers, spaces (except\n< > * % & : \\ ? / +\n)\nNot start with\nazure\n,\nmicrosoft\n, or\nwindows\nNot duplicate an existing key\nAzure tag values must:\nContain 1-256 characters\nContain letters, numbers, spaces (except\n< > * % & : \\ ? / +\n)\nNot start with\nazure\n,\nmicrosoft\n, or\nwindows\nFor more information, please refer to the Azure\ntag resource limitations\ndocumentation.\nDelete\nGCP\nGoogle Cloud tag keys must:\nContain 1-63 characters\nContain letters, numbers, or the characters\n- _ .\nNot duplicate an existing key\nGoogle Cloud tag values must:\nContain 1-63 characters\nContain letters, numbers, or the characters\n- _ .\nFor more information, please refer to the Google Cloud\nrequirements for labels\ndocumentation.\nDelete\nSolution\nDatabricks can not modify these limits.\nRequests to update any limits on tagging must be made directly with the cloud provider support team." +} \ No newline at end of file diff --git a/scraped_kb_articles/failed-to-add-user-error-due-to-email-or-username-already-existing-with-a-different-case.json b/scraped_kb_articles/failed-to-add-user-error-due-to-email-or-username-already-existing-with-a-different-case.json new file mode 100644 index 0000000000000000000000000000000000000000..b54580032d4964619d25f755f366cae49da5bf66 --- /dev/null +++ b/scraped_kb_articles/failed-to-add-user-error-due-to-email-or-username-already-existing-with-a-different-case.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/failed-to-add-user-error-due-to-email-or-username-already-existing-with-a-different-case", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/failed-to-create-query-error-when-upgrading-external-metastore-to-unity-catalog.json b/scraped_kb_articles/failed-to-create-query-error-when-upgrading-external-metastore-to-unity-catalog.json new file mode 100644 index 0000000000000000000000000000000000000000..c6248c456134105ca542bc1ca72d915fb99d6ec5 --- /dev/null +++ b/scraped_kb_articles/failed-to-create-query-error-when-upgrading-external-metastore-to-unity-catalog.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/failed-to-create-query-error-when-upgrading-external-metastore-to-unity-catalog", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to migrate your default Hive metastore to Unity Catalog by following the steps in the\nUpgrade a schema or multiple tables to Unity Catalog\n(\nAWS\n|\nAzure\n) documentation.\nYou have created the necessary\nstorage credential\n(\nAWS\n|\nAzure\n) as well as the\nexternal location\n(\nAWS\n|\nAzure\n). Permissions are correct. You click\nCreate Query for Upgrade\nand try to run the generated query, but it generates a\nFailed to create query\nerror.\nFailed to create query try again {\"message\": \"Internal Server Error\"}\nCause\nCreate query for upgrade\nonly works when running the command on a warehouse. If you select a cluster instead of a warehouse in\nData Explorer\n, the upgrade query fails to run.\nSolution\nEnsure you select a warehouse from the drop down menu in the upper right corner of the\nData Explorer\npage before you try to run an upgrade query." +} \ No newline at end of file diff --git a/scraped_kb_articles/failed-to-create-space-message-when-attempting-to-create-new-genie-space.json b/scraped_kb_articles/failed-to-create-space-message-when-attempting-to-create-new-genie-space.json new file mode 100644 index 0000000000000000000000000000000000000000..caece223c181f4bc3d280a7d9b92d46c40980c63 --- /dev/null +++ b/scraped_kb_articles/failed-to-create-space-message-when-attempting-to-create-new-genie-space.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/bi/failed-to-create-space-message-when-attempting-to-create-new-genie-space", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you navigate to\nGenie\nfrom your workspace dashboard view, you only see a blank screen. When you try to click the\n+New\nbutton, you receive the following error message.\n\"error_code\": \"PERMISSION_DENIED\",\r\n  \"message\": \"Partner-powered AI features are disabled. Please contact your administrator.\",\r\n  \"details\": [{
}]\nThe following image shows an additional message in the UI, “failed to create space,” along with the error message.\nCause\nYou have the setting\nEnforce data processing within workspace Geography for Designated Services\nenabled for the workspace in your account console.\nSolution\nNavigate to Account console > Workspace >\nSecurity and compliance\n.\nToggle off\nEnforce data processing within workspace Geography for Designated Services\nto disable the setting. The following image shows the setting within the workspace UI.\nFor more information, review the\nCurate an effective Genie space\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/failed-to-install-elasticsearch-via-maven.json b/scraped_kb_articles/failed-to-install-elasticsearch-via-maven.json new file mode 100644 index 0000000000000000000000000000000000000000..66b07e34362fe4ede6da8d6738f92e08083001a4 --- /dev/null +++ b/scraped_kb_articles/failed-to-install-elasticsearch-via-maven.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/failed-to-install-elasticsearch-via-maven", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to install Elasticsearch via Maven when you get a\nDRIVER_LIBRARY_INSTALLATION_FAILURE\nerror message saying that the library resolution failed.\nError Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: Library resolution failed. Cause: java.lang.RuntimeException: org.slf4j:slf4j-api download failed.\nCause\nThe Elasticsearch library is trying to install dependencies that are already installed, resulting in a conflict.\nSolution\nThis can be resolved by excluding the dependencies before starting the install.\nSelect\nCompute\nfrom the left side menu.\nClick on the name of the cluster you want to modify.\nClick\nLibraries\n.\nClick\nInstall new\n.\nClick\nMaven\n.\nEnter\norg.elasticsearch:elasticsearch-spark-30_2.12:7.17.6\nin the\nCoordinates\ntext box.\nEnter\ncommons-logging:commons-logging,org.slf4j:slf4j-api,com.google.protobuf:protobuf-java,javax.xml.bind:jaxb-api\nin the\nExclusions\ntext box.\nClick\nInstall\n.\nYou should now be able to start your cluster and successfully complete the Elasticsearch install." +} \ No newline at end of file diff --git a/scraped_kb_articles/failing-api-calls-in-mlflow-because-of-float64-column-values-.json b/scraped_kb_articles/failing-api-calls-in-mlflow-because-of-float64-column-values-.json new file mode 100644 index 0000000000000000000000000000000000000000..bef7ffb3607a078c15ccdf39c04757d12151c4c1 --- /dev/null +++ b/scraped_kb_articles/failing-api-calls-in-mlflow-because-of-float64-column-values-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/failing-api-calls-in-mlflow-because-of-float64-column-values-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen calling the Databricks Machine Learning API to score a model endpoint, you notice the API calls failing as Float64 values in a column are not recognized.\nException: Request failed with status 400, {\"error_code\": \"BAD_REQUEST\", \"message\": \"Incompatible input types for column {your-column-name}. Can not safely convert float64 to float32.\"}.\nThis issue is not specific to any particular cluster, workflow, or notebook, and persists across different clusters with ML runtime and various workflows and notebooks.\nCause\nWhen the input data schema is not explicitly logged, the schema inferred from the JSON (created by\njson_dumps\n) does not match the expected schema. This causes the failing API call, particularly when converting Float64 to Float32.\nSolution\nExplicitly log the input schema when creating the model.\nManually construct the signature object for the input schema using the\nmlflow.models.ModelSignature\nand\nmlflow.types.schema.Schema\nclasses.\nLog the model with the explicitly defined input schema.\nExample\nfrom mlflow.models import ModelSignature, infer_signature\r\nfrom mlflow.types.schema import Schema, ColSpec\r\n\r\n# Define the input schema\r\ninput_schema = Schema([\r\n    ColSpec(\"double\", \"sepal length (cm)\"),\r\n    ColSpec(\"double\", \"sepal width (cm)\"),\r\n    ColSpec(\"double\", \"petal length (cm)\"),\r\n    ColSpec(\"double\", \"petal width (cm)\")\r\n])\r\n\r\n# Define the output schema\r\noutput_schema = Schema([ColSpec(\"long\")])\r\n\r\n# Create the signature object\r\nsignature = ModelSignature(inputs=input_schema, outputs=output_schema)\r\n\r\n# Log the model with the schema\r\nmlflow.pyfunc.log_model(artifact_path=\"model\", python_model=, signature=signature)\nFor more information, please refer to the\nMLflow Model Signatures and Input Examples Guide\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/failure-to-initialize-configuration-for-storage-account-while-registering-a-model-serving-model-in-unity-catalog.json b/scraped_kb_articles/failure-to-initialize-configuration-for-storage-account-while-registering-a-model-serving-model-in-unity-catalog.json new file mode 100644 index 0000000000000000000000000000000000000000..6fd387b0eaba57b835d0d63ae7d57cd76a303df2 --- /dev/null +++ b/scraped_kb_articles/failure-to-initialize-configuration-for-storage-account-while-registering-a-model-serving-model-in-unity-catalog.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/failure-to-initialize-configuration-for-storage-account-while-registering-a-model-serving-model-in-unity-catalog", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to register a serving model endpoint to the Unity Catalog metastore when you get a\nFailure to initialize configuration for storage account\nerror message.\nExample error message\nExecutionError: An error occurred while calling o437.ls.\r\n: Failure to initialize configuration for storage account XXXXXdbrmetastoreeus2.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key\r\n\r\nMlflowException: The following failures occurred while uploading one or more artifacts to abfss://metastore@XXXXXXXXXXX.dfs.core.windows.net/5f2ca358-314a-42da-9a9b-2eca16415dd0/models/e40b317e-9f99-4d86-8384-4a4daa7845d7/versions/b1a714a7-3704-43d9-bbbd-cb81262ef98d: {'/local_disk0/repl_tmp_data/ReplId-76b2b-ace2e-48b5d-6/tmpb8d4j1_k/model/model.pkl': \"HttpResponseError('(AuthorizationFailure) This request is not authorized to perform this operation.\\\\nRequestId:e882052a-301f-0057-45ee-fa7936000000\\\\nTime:2024-08-30T15:09:54.6446260Z\\\\nCode: AuthorizationFailure\\\\nMessage: This request is not authorized to perform this operation.\\\\nRequestId:e882052a-301f-0057-45ee-fa7936000000\\\\nTime:2024-08-30T15:09:54.6446260Z')\", '/local_disk0/repl_tmp_data/ReplId-76b2b-ace2e-48b5d-6/tmpb8d4j1_k/model/requirements.txt': \"HttpResponseError('(AuthorizationFailure) This request is not authorized to perform this operation.\\\\nRequestId:0328be4e-d01f-0019-65ee-fabcd3000000\\\\nTime:2024-08-30T15:09:54.6494435Z\\\\nCode: AuthorizationFailure\\\\nMessage: This request is not authorized to perform this operation.\\\\nRequestId:0328be4e-d01f-0019-65ee-fabcd3000000\\\\nTime:2024-08-30T15:09:54.6494435Z')\", '/local_disk0/repl_tmp_data/ReplId-76b2b-ace2e-48b5d-6/tmpb8d4j1_k/model/python_env.yaml': \"HttpResponseError('(AuthorizationFailure) This request is not authorized to perform this operation.\\\\nRequestId:f38fc42e-501f-0008-43ee-fa8bc8000000\\\\nTime:2024-08-30T15:09:54.6518467Z\\\\nCode: AuthorizationFailure\\\\nMessage: This request is not authorized to perform this operation.\\\\nRequestId:f38fc42e-501f-0008-43ee-fa8bc8000000\\\\nTime:2024-08-30T15:09:54.6518467Z')\", '/local_disk0/repl_tmp_data/ReplId-76b2b-ace2e-48b5d-6/tmpb8d4j1_k/model/MLmodel': \"HttpResponseError('(AuthorizationFailure) This request is not authorized to perform this operation.\\\\nRequestId:bad77956-201f-0007-75ee-fa663e000000\\\\nTime:2024-08-30T15:09:54.6167981Z\\\\nCode: AuthorizationFailure\\\\nMessage: This request is not authorized to perform this operation.\\\\nRequestId:bad77956-201f-0007-75ee-fa663e000000\\\\nTime:2024-08-30T15:09:54.6167981Z')\", '/local_disk0/repl_tmp_data/ReplId-76b2b-ace2e-48b5d-6/tmpb8d4j1_k/model/conda.yaml': \"HttpResponseError('(AuthorizationFailure) This request is not authorized to perform this operation.\\\\nRequestId:e5ee5a65-101f-0069-1aee-facf17000000\\\\nTime:2024-08-30T15:09:54.5864086Z\\\\nCode: AuthorizationFailure\\\\nMessage: This request is not authorized to perform this operation.\\\\nRequestId:e5ee5a65-101f-0069-1aee-facf17000000\\\\nTime:2024-08-30T15:09:54.5864086Z')\"}\nCause\nThe issue can occur when an external location is not provisioned during the initial Unity Catalog configuration.\nTo register a serving model with Unity Catalog, the serving model needs to have an external\nstorage_root\ndefined on the Unity Catalog metastore.\nThe root metastore may have been blocked by a firewall, resulting in a failure when uploading model serving artifacts.\nExample\nYou can validate your storage root location with this sample code to see if you have properly configured an external location. Replace\n\nand\n\nwith your own catalog and schema when running the sample code.\n%sql\r\n\r\nDESCRIBE CATALOG EXTENDED ;\r\nDESCRIBE SCHEMA EXTENDED ;\nIf\nStorage Root\nand\nStorage Location\nvalues are empty when you get the results, an external location was not assigned to the catalog or the schema.\nSolution\nYou cannot modify the catalog or schema of an existing metastore. You must\ncreate a new Unity Catalog metastore\n(\nAWS\n|\nAzure\n) with a different storage location.\nAfter creating a new metastore, you have to re-register your model in the new location. This ensures that artifacts are uploaded to the specified catalog or schema, instead of the default root metastore storage. Storing artifacts in the default root metastore storage is not recommended.\nFor more details, review the\nModel serving with Databricks\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/failure-when-mounting-storage.json b/scraped_kb_articles/failure-when-mounting-storage.json new file mode 100644 index 0000000000000000000000000000000000000000..8efc7218aa32b3205aa8ab892299d52589cd8758 --- /dev/null +++ b/scraped_kb_articles/failure-when-mounting-storage.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/failure-when-mounting-storage", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to access an existing mount point, or create a new mount point, and it fails with an error message.\nInvalid Mount Exception:The backend could not get tokens for path /mnt.\nCause\nThe root mount path (\n/mnt\n) is also mounted to a storage location.\nYou can verify that something is mounted to the root path by listing all mount points with DBUtils (\nAWS\n|\nAzure\n|\nGCP\n).\n%python\r\n\r\ndbutils.fs.mounts()\nIf\n/mnt\nis listed with a source, you have storage incorrectly mounted to the root path..\nSolution\nYou should unmount the root mount path.\n%python\r\n\r\ndbutils.fs.unmount(\"/mnt\")\nYou can now access existing mount points, or create new mount points." +} \ No newline at end of file diff --git a/scraped_kb_articles/fetching-the-last-access-datetime-of-tables-across-all-workspaces-runs-slowly-and-inefficiently.json b/scraped_kb_articles/fetching-the-last-access-datetime-of-tables-across-all-workspaces-runs-slowly-and-inefficiently.json new file mode 100644 index 0000000000000000000000000000000000000000..8138e04fd136c6c81dc9b77d4c21a134766fc416 --- /dev/null +++ b/scraped_kb_articles/fetching-the-last-access-datetime-of-tables-across-all-workspaces-runs-slowly-and-inefficiently.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metrics/fetching-the-last-access-datetime-of-tables-across-all-workspaces-runs-slowly-and-inefficiently", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou need to retrieve the last access date/time of a table across all workspaces, but the per-table approach using\nSHOW TABLE EXTENDED\ncommand, or checking the insights tab, seems slow and inefficient.\nCause\nSHOW TABLE EXTENDED\nqueries each table one at a time, one workspace at a time, causing a surge of requests to hit your metastore.\nSolution\nDatabricks recommends using audit log system tables.\nFirst, check that you have the necessary permissions to access audit logs. You can grant permissions using the Databricks SQL query\nGRANT SELECT ON system.access.audit TO \n.\nRun the following SQL query to retrieve the last access date/time of a table. Replace\n\nand\n\nwith your respective database and table names.\nSELECT\r\naction_name as `EVENT`,\r\nevent_time as `WHEN`,\r\nrequest_params, user_identity.email,\r\nIFNULL(request_params.full_name_arg, 'Non-specific') AS `TABLE ACCESSED`,\r\nIFNULL(request_params.commandText,'GET table') AS `QUERY TEXT`\r\nFROM system.access.audit\r\nWHERE request_params.full_name_arg = 'catalog_.'\r\nAND action_name IN ('createTable', 'commandSubmit','getTable','deleteTable')\r\norder by event_time DESC\nNote\nAudit logs maintain information for all workspaces in your account for the same cloud region. If the workspace is available in a different region, query separately from that region.\nFor more information, refer to the\nAudit log system table reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/field-name-sorting-changes-in-apache-spark-3x.json b/scraped_kb_articles/field-name-sorting-changes-in-apache-spark-3x.json new file mode 100644 index 0000000000000000000000000000000000000000..1d773f6c817fd4ad30f0e9ef850d83624f992768 --- /dev/null +++ b/scraped_kb_articles/field-name-sorting-changes-in-apache-spark-3x.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/field-name-sorting-changes-in-apache-spark-3x", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a map transformation on a RDD using Databricks Runtime 9.1 LTS and above, the resulting schema order is different when compared to doing the same map transformation using Databricks Runtime 7.3 LTS.\nCause\nDatabricks Runtime 9.1 LTS and above incorporate Apache Spark 3.x. Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically. Instead, they are ordered in as entered.\nSolution\nTo enable Spark 2.x style row sorting set\nPYSPARK_ROW_FIELD_SORTING_ENABLED\nto\ntrue\nin your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nPYSPARK_ROW_FIELD_SORTING_ENABLED=true\nFor Python versions less than 3.6, the field names can only be sorted alphabetically.\nDelete\nWarning\nThis workaround is deprecated and will be removed in a future version of Spark." +} \ No newline at end of file diff --git a/scraped_kb_articles/fields_already_exists-error-in-sparksql-when-changing-column-name-capitalization.json b/scraped_kb_articles/fields_already_exists-error-in-sparksql-when-changing-column-name-capitalization.json new file mode 100644 index 0000000000000000000000000000000000000000..540c38cf5905ccc21a41206d0d02c3d6b658e9af --- /dev/null +++ b/scraped_kb_articles/fields_already_exists-error-in-sparksql-when-changing-column-name-capitalization.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/fields_already_exists-error-in-sparksql-when-changing-column-name-capitalization", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to change the case of a column name table using the\nALTER TABLE .. RENAME COLUMN\ncommand in Databricks when you encounter an error.\nThis issue typically occurs when attempting to rename a column to a name that already exists in the table, but with a different case. For example, renaming the column\nDescription\nto\ndescription\nin a table that already has a\ndescription\ncolumn.\nExample error\n[FIELDS_ALREADY_EXISTS] Cannot rename column, because `description` already exists...\nCause\nThe underlying cause of this issue is the case-insensitive nature of column names in Databricks. Column names are stored in a case-insensitive manner, meaning that\nDescription\nand\ndescription\nare considered the same column name.\nSolution\nSet\nspark.sql.caseSensitive\nto\ntrue\nto change the default behaviour.\n%python\r\n\r\nspark.sql(\"set spark.sql.caseSensitive=true\");\nYou may need to\nenable column mapping\n(\nAWS\n|\nAzure\n|\nGCP\n)  on your Delta table with the mapping mode name.\nYou will need to verify the\nprotocol version\n(\nAWS\n|\nAzure\n|\nGCP\n) of your table. If it is using an old protocol, you will need to upgrade the protocol version.\nIf your table is already on or above the required protocol version:\n%python\r\n\r\nALTER TABLE table_name SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')\nIf your table is not on the required protocol version and requires a protocol upgrade:\n%python\r\n\r\nALTER TABLE table_name SET TBLPROPERTIES (\r\n  'delta.columnMapping.mode' = 'name',\r\n  'delta.minReaderVersion' = '2',\r\n  'delta.minWriterVersion' = '5')\nInfo\nIf you don’t want to permanently change the case sensitivity configuration, you can apply a workaround. For example, temporarily change the name to another one, then change it back with a case modification." +} \ No newline at end of file diff --git a/scraped_kb_articles/file-corruption-error-on-apache-spark-streaming-jobs-during-file-processing-in-dbfs.json b/scraped_kb_articles/file-corruption-error-on-apache-spark-streaming-jobs-during-file-processing-in-dbfs.json new file mode 100644 index 0000000000000000000000000000000000000000..c907c9a0a78faeac0b7e90f138663764e5987333 --- /dev/null +++ b/scraped_kb_articles/file-corruption-error-on-apache-spark-streaming-jobs-during-file-processing-in-dbfs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/file-corruption-error-on-apache-spark-streaming-jobs-during-file-processing-in-dbfs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn Databricks Apache Spark Streaming jobs, when processing files in Databricks File System (DBFS) you notice file corruption occurring with the following error.\nRuntimeException: dbfs:/mnt///part-00000-.c000.snappy.parquet is not a Parquet file. Expected magic number at tail, but found [65, 20, -64, -72]\nThis issue impacts the stability and data integrity of Spark Streaming jobs that read and write files in DBFS.\nCause\nFile corruption can be a result of using\ndbutils.fs\nwithin executors, which is not supported. For instance, consider the following scenario where you're performing a recursive delete operation on executors.\ndbutils.fs.ls(p).map(_.path).toDF.foreach { file =>\r\n  dbutils.fs.rm(file(0).toString, true)\r\n}\nIn this case,\ndbutils.fs\ncreates a DBFS instance using a default Hadoop configuration that lacks the setting\ndbfs.client.version = v2\n. Without this configuration, the improperly initialized DBFS v1 instance gets cached. When Spark writes files, it mistakenly uses this incorrect instance. Since DBFS v1 can cause incomplete file writes if the DataDaemon crashes or restarts, this may lead to incomplete writes at the target location and result in file corruption.\nSolution\nReplace\ndbutils.fs\noperations with Hadoop filesystem methods to avoid caching issues. Databricks recommends using recursive deletion with Hadoop FS APIs for better scalability. Test in a staging environment before production.\nimport scala.util.{Try, Success, Failure}\r\nimport org.apache.hadoop.fs._\r\n\r\nval source = \"dbfs:/mnt///\"\r\nval conf = new org.apache.spark.util.SerializableConfiguration(sc.hadoopConfiguration)\r\nval broadcastConf = sc.broadcast(conf)\r\n\r\ndef delete(p: String): Unit = {\r\n dbutils.fs.ls(p).map(_.path).toDF.foreach { file =>\r\n    val conf = broadcastConf.value.value\r\n    val delPath = new Path(file(0).toString)\r\n    val toFs = delPath.getFileSystem(conf)\r\n    toFs.delete(delPath, true)\r\n }\r\n}\r\n\r\ndef walkDelete(root: String)(level: Int): Unit = {\r\n dbutils.fs.ls(root).map(_.path).foreach { p =>\r\n    println(s\"Deleting: $p, on level: $level\")\r\n    val deleting = Try {\r\n      if (level == 0) delete(p)\r\n      else if (p endsWith \"/\") walkDelete(p)(level - 1)\r\n      else delete(p)\r\n    }\r\n    deleting match {\r\n      case Success(_) => println(s\"Successfully deleted $p\")\r\n      case Failure(e) => if (!e.getMessage.contains(\"specified path does not exist\")) throw e\r\n    }\r\n }\r\n}\r\nval level = 1 //set depending on your partition levels\r\nwalkDelete(source)(level)\nAlternatively, you can use parallelized deletion. Parallelized deletion only parallelizes at the top level, which may be slower with skewed partitions.\nimport scala.util.{Try, Success, Failure}\r\nimport org.apache.hadoop.fs._\r\n\r\nval source = \"dbfs:/mnt///\"\r\nval conf = new org.apache.spark.util.SerializableConfiguration(sc.hadoopConfiguration)\r\nval broadcastConf = sc.broadcast(conf)\r\nval filesToDel = dbutils.fs.ls(source).map(_.path)\r\n\r\nspark.sparkContext.parallelize(filesToDel).foreachPartition { rows =>\r\n rows.foreach { file =>\r\n    val conf = broadcastConf.value.value\r\n    val delPath = new Path(file)\r\n    val toFs = delPath.getFileSystem(conf)\r\n    toFs.delete(delPath, true)\r\n }\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/file-sink-streaming.json b/scraped_kb_articles/file-sink-streaming.json new file mode 100644 index 0000000000000000000000000000000000000000..6e977a6f6f8d69f16cb6ac7469683a4ba73a78cf --- /dev/null +++ b/scraped_kb_articles/file-sink-streaming.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/file-sink-streaming", + "title": "Título do Artigo Desconhecido", + "content": "When you stream data into a file sink, you should always change both checkpoint and output directories together. Otherwise, you can get failures or unexpected outputs.\nApache Spark creates a folder inside the output directory named\n_spark_metadata\n. This folder contains write-ahead logs for every batch run. This is how Spark gets exactly-once guarantees when writing to a file system. This folder contains save files for each batch (named 0,1,2,3 etc + 19.compact, n.compact etc). These files include JSON that gives details about the output for the particular batch. With the help of this data, once a batch has succeeded, any duplicate batch output is discarded.\nIf you change the checkpoint directory but not the output directory:\nWhen you change the checkpoint directory, the stream job will start batches again from 0. Since 0 is already present in the\n_spark_metadata\nfolder, the output file will be discarded even if it has new data. That is, if you stop the previous run on the 500th batch, the next run with same output directory and different checkpoint directory will give output only on the 501st batch. All of the previous batches will be silently discarded.\nIf you change the output directory but not the checkpoint directory:\nWhen you change only the output directory, it loses all of the batch data from the\n_spark_metadata\nfolder. But Spark starts writing from the next batch according to the checkpoint directory. For example, if the previous run was stopped at 500, the first write of the new stream job will be at file 501 on\n_spark_metadata\nand you lose all of the old batches. When you read the files back, you get the error\nmetadata for batch 0(or first compact file (19.compact)) is not found\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/filenotfounderror-when-trying-to-use-android-development-bridge-adb-command-line-tool-on-a-cluster.json b/scraped_kb_articles/filenotfounderror-when-trying-to-use-android-development-bridge-adb-command-line-tool-on-a-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..02f25595b2285bed5586acb99eab81a735e6da31 --- /dev/null +++ b/scraped_kb_articles/filenotfounderror-when-trying-to-use-android-development-bridge-adb-command-line-tool-on-a-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/filenotfounderror-when-trying-to-use-android-development-bridge-adb-command-line-tool-on-a-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use the Android Development Bridge command line tool (\nadb\ntool) to list or connect to Android devices from a cluster, you receive an error.\nError - “FileNotFoundError: [Errno 2] No such file or directory: 'adb'”\nYou notice the\nadb\ntool works in your local system.\nCause\nThe\nadb\ntool is not installed in the correct system path, causing commands to go unrecognized on the cluster.\nThe\nadb\ntool comes pre-packaged with Android Studio and Android Platform Tools Bundle, which is why it works in your local system.\nSolution\nUse the following example init script to install the\nadb\ntool in the cluster and set in the correct system path. Because the Databricks Runtime images are Ubuntu-based, the init script uses the Linux download for the\nadb\ntool.\n#!/bin/bash\r\n# Download the latest platform-tools for Linux\r\nwget -qO /tmp/platform-tools.zip https://dl.google.com/android/repository/platform-tools-latest-linux.zip\r\n# Unzip the downloaded file to a temporary directory\r\nunzip -q /tmp/platform-tools.zip -d /tmp/\r\nsudo cp /tmp/platform-tools/adb /usr/local/bin/adb\r\n# Clean up temporary files and directories\r\nrm -rf /tmp/platform-tools.zip /tmp/platform-tools\r\n# Optional: Verify the installation by printing adb version\r\nadb version\nFor more information, refer to the Android\nSDK Platform Tools release notes\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/filereadexception-error-when-trying-to-run-streaming-job-reading-from-system-tables.json b/scraped_kb_articles/filereadexception-error-when-trying-to-run-streaming-job-reading-from-system-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..53f66afe0ed44d131cde5ab0211bb8747a8960e7 --- /dev/null +++ b/scraped_kb_articles/filereadexception-error-when-trying-to-run-streaming-job-reading-from-system-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/filereadexception-error-when-trying-to-run-streaming-job-reading-from-system-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running a streaming job that reads from system tables using Delta Sharing, you encounter a\nFileReadException\nerror.\nExample error message\ncom.databricks.sql.io.FileReadException: Error while reading file delta-sharing:/XXXXXXXXXXXXXXXX.uc-deltasharing%253A%252F%252Fsystem.access.audit%2wer3system.access.audit_XXXXXXXXXXXXXX. Caused by: org.apache.spark.SparkIOException: [HDFS_HTTP_ERROR.KEY_NOT_EXIST] When attempting to read from HDFS, HTTP request failed. Status 404 Not Found. Could not find key: HTTP request failed with status: HTTP/1.1 404 Not Found \r\nNoSuchKey The specified key does not exist.\nCause\nYour scheduled job runs don’t occur frequently enough to keep up with the current source table version. You’re working in an outdated source table version and your streaming job is trying to read data files that no longer exist on the source system table.\nAdditionally, when a job runs, if\nmaxVersionsPerRpc\nis set at the default (\n100\n), the streaming query only processes 100 versions of the source table per remote procedure call (RPC) request. If the current source table version is over 100 versions away, a single job run can’t catch up your table version to the current version.\nSolution\nIncrease the job frequency so the source table version you’re working with doesn't fall behind the current version.\nAlternatively, keep your job frequency at one batch per day but increase the\nmaxVersionsPerRpc\nto\n500\nor an upper limit that allows for processing a day’s worth of data.\nImportant\nIncreasing\nmaxVersionsPerRpc\nmeans more files to be processed, increasing the likelihood of hitting Delta Sharing server limits (either 1000 files or five minutes).\nFor more information on streaming, review the\nRead Delta Sharing Tables\ndocumentation.\nFor more information on system tables, review the\nMonitor account activity with system tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/filereadexception-on-dbfs-mounted-filesystem.json b/scraped_kb_articles/filereadexception-on-dbfs-mounted-filesystem.json new file mode 100644 index 0000000000000000000000000000000000000000..565d4c9c1b9687d9115c75f101341625f070e713 --- /dev/null +++ b/scraped_kb_articles/filereadexception-on-dbfs-mounted-filesystem.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/filereadexception-on-dbfs-mounted-filesystem", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Apache Spark jobs are failing with a\nFileReadException\nerror when attempting to read files on DBFS (Databricks File System) mounted paths.\norg.apache.spark.SparkException: Job aborted due to stage failure: Task x in stage y failed n times, most recent failure: Lost task 0.3 in stage 141.0 (TID 770) (x.y.z.z executor 0): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/Cloudfolder/folder1/silver_table/part-00000-twerrx-abcd-4538-ae46-87041a4fxxxx-c000.snappy.parquet\nCause\nA\nFileReadException\nerror can occur when a job dynamically handles mounts. When a series of mounts and unmounts happen for the same path in the same workspace, the driver and the executors can end up with different paths at the same mount point, depending on when the cluster was started and/or the mount was initialized.\nSolution\nTo prevent the error from occurring, you need to include the\ndbutils.fs.refreshMounts()\ncommand in your Spark job before you reference a DBFS path.\nThe\ndbutils.fs.refreshMounts()\ncommand refreshes the mount points in the current workspace. This ensures that the executors and driver have a current and consistent view of the mount, regardless of when the cluster was started and/or the mount was initialized.\nFor more information, please review the\ndbutils.fs.refreshMounts()\ndocumentation\n(\nAWS\n|\nAzure\n|\nGCP\n).\nDelete\nInfo\nIf possible, you should refer to the direct path of the storage URI when running streaming jobs. If you have to use mounts, try to avoid repeated mounting and unmounting." +} \ No newline at end of file diff --git a/scraped_kb_articles/filereadexception-when-reading-delta-table.json b/scraped_kb_articles/filereadexception-when-reading-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..8504ac8b223268f5b46b410ed9ef3ec4a09b9e43 --- /dev/null +++ b/scraped_kb_articles/filereadexception-when-reading-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/filereadexception-when-reading-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou attempt to read a Delta table from mounted storage and get a\nFileReadException\nerror.\nFileReadException: Error while reading file abfss:REDACTED@REDACTED.dfs.core.windows.net/REDACTED/REDACTED/REDACTED/REDACTED/PARTITION=REDACTED/part-00042-0725ec45-5c32-412a-ab27-5bc88c058773.c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions\r\n Caused by: FileNotFoundException: Operation failed: 'The specified path does not exist.', 404, HEAD, https:// REDACTED.dfs.core.windows.net/ REDACTED/ REDACTED/REDACTED/REDACTED/PARTITION=REDACTED/part-00042-0725ec45-5c32-412a-ab27-5bc88c058773.c000.snappy.parquet?upn=false&action=getStatus&timeout=90\nCause\nFileReadException\nerrors occur when the underlying data does not exist. The most common cause is manual deletion.\nIf the underlying data was not manually deleted, the mount point for the storage blob was removed and recreated while the cluster was writing to the Delta table.\nDelta Lake does not fail a table write if the location is removed while the data write is ongoing. Instead, a new folder is created in the default storage account of the workspace, with the same path as the removed mount. Data continues to be written in that location.\nIf the mount is recreated before the write operation is finished, and the Delta transaction logs are made available again, Delta updates the transaction logs and the write is considered successful. When this happens, data files written to the default storage account while the mount was deleted are not accessible, as the path currently references the mounted storage account location.\nDelete\nInfo\nYou can use diagnostic logging to verify that a mount was removed. Query the DBFS table for\nmount\nand\nunmount\nevents.\nFor example:\nDatabricksDBFS\r\n| where ActionName == \"unmount\" or ActionName == \"mount\"\nSolution\nYou can restore the missing data in one of two ways.\nRepair the Delta table and add the missing data back with a custom job.\nUse\nFSCK\nto repair the table.\n%sql\r\n\r\nFSCK REPAIR TABLE \nRewrite the missing data with a custom job. This option is a good choice if you can re-run the last job without risking duplicate data.\nManually recover the missing files.\nVerify that there are no active jobs reading or writing to the mounted storage account that contains the Delta table.\nUnmount the mount path. This allows you to access the\n/mnt/\ndirectory in the default storage account.\n%python\r\n\r\ndbutils.fs.unmount(\"/mnt/\")\nUse\ndbutils.fs.mv\nto move the files located in the table path to a temporary location.\n%python\r\n\r\ndbutils.fs.mv(\"/mnt/\", \"/tmp/tempLocation/\", True))\nRecreate the mount point.\n%python\r\n\r\ndbutils.fs.mount(source = \"abfss://@.dfs.core.windows.net/\", mount_point = \"/mnt/\", extra_configs = configs)\nReview the\nAccess Azure Data Lake Storage Gen2 and blob storage\ndocumentation for more information.\nMove the files from the temporary location to the updated Delta table path.\n%python\r\n\r\ndbutils.fs.mv(\"/tmp/tempLocation\", \"/mnt/\", True))\nIf any jobs are reading or writing to the mount point when you attempt a manual recovery you may cause the issue to reoccur. Verify that the mount is not in use before attempting a manual repair.\nBest practices\nInstruct users to get approval before unmounting a storage location.\nIf you must unmount a storage location, verify there are no jobs running on the cluster.\nUse\ndbutils.fs.updateMount\nto update information about the mount. Do not use\nunmount\nand\nmount\nto update the mount.\nUse diagnostic logging to identify any possible\nunmount\nissues.\nRun production jobs only on job clusters which are not affected by temporary unmount commands while running, unless they run the\ndbutils.fs.refreshMounts\ncommand.\nWhen running jobs on interactive clusters, add a verification step at the end of a job (such as a count) to check for missing data files. If any are missing an error is triggered immediately." +} \ No newline at end of file diff --git a/scraped_kb_articles/files-restored-from-a-delta-table-archive-are-not-recognized-by-delta-with-archival-support-enabled.json b/scraped_kb_articles/files-restored-from-a-delta-table-archive-are-not-recognized-by-delta-with-archival-support-enabled.json new file mode 100644 index 0000000000000000000000000000000000000000..19cc88f6a244571fe4be2b15ca99159e557abc95 --- /dev/null +++ b/scraped_kb_articles/files-restored-from-a-delta-table-archive-are-not-recognized-by-delta-with-archival-support-enabled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/files-restored-from-a-delta-table-archive-are-not-recognized-by-delta-with-archival-support-enabled", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Databricks Delta\narchival support\n(\nAWS\n|\nAzure\n|\nGCP\n). You restored files from the archive storage, but when you run a query, the files are not recognized by Delta. You get a “delta archived files in limit” error message.\n[DELTA_ARCHIVED_FILES_IN_LIMIT] Table delta.`` does not contain enough records in non-archived files to satisfy the specified LIMIT of 10001 records.\nThis happens even after you have restored all files identified using the\nSHOW ARCHIVED FILES\ncommand.\nCause\nThe error\nDELTA_ARCHIVED_FILES_IN_LIMIT\ntypically occurs when a query attempts to access data files older than the configured archival threshold in\ndelta.timeUntilArchived\n. This is expected behaviour.\nHowever, even after correctly identifying and restoring all the necessary files using the\nSHOW ARCHIVED FILES\ncommand, Delta still fails to recognize these restored files. This is a known limitation in the current version of the archival feature, where restored files from the archive are not properly detected by Delta.\nSolution\nYou can work around this by temporarily disabling archival support and then accessing the restored files.\nALTER TABLE UNSET TBLPROPERTIES ('delta.timeUntilArchived');\nAlternatively, you can increase the value for\ndelta.timeUntilArchived\nso the time period includes all of the data files that were restored. For example, if the current value is 10 days, but the files you restored are 30 days old, you can update to\ndelta.timeUntilArchived\n31 days. After the time period has been updated, Delta should recognize the restored files." +} \ No newline at end of file diff --git a/scraped_kb_articles/filter-condition-in-the-for-each-task-type-not-filtering-correctly.json b/scraped_kb_articles/filter-condition-in-the-for-each-task-type-not-filtering-correctly.json new file mode 100644 index 0000000000000000000000000000000000000000..b566d2375bef8097196e9c554289ab8367df1b62 --- /dev/null +++ b/scraped_kb_articles/filter-condition-in-the-for-each-task-type-not-filtering-correctly.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/filter-condition-in-the-for-each-task-type-not-filtering-correctly", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you pass\nparam = \"${param}\"\nwhere\n${param}\nis meant to dynamically reference values from job parameters, you notice the filter condition in the\nfor each\ntask type does not filter records as expected.\nCause\nThe\n${param}\nsyntax doesn’t correctly parse or apply the filter condition within the\nfor each\ntask type, resulting in an empty result set. This is a known limitation.\nSolution\nIn Databricks Runtime 15.2 or above, or in SQL warehouse, use the\n:Param\nsyntax in your filter condition instead. For example,\nparam = :Param\n.\nExample\nThe following is a modified query using the\n:Param syntax\n. The accompanying screenshot demonstrates the modified query with\nplace = :Place\n.\nselect * from where = :\nFor more information, review the\nWork with query parameters\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/filtering-data-with-char255-datatype-column-does-not-retrieve-a-result.json b/scraped_kb_articles/filtering-data-with-char255-datatype-column-does-not-retrieve-a-result.json new file mode 100644 index 0000000000000000000000000000000000000000..8e82f045dfd07206857dd8404b022f2e4e17196d --- /dev/null +++ b/scraped_kb_articles/filtering-data-with-char255-datatype-column-does-not-retrieve-a-result.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/filtering-data-with-char255-datatype-column-does-not-retrieve-a-result", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWorking in Databricks Runtime 14.3 LTS or below, you create a Delta table with a column that sets the\nchar(255)\ndata type in a notebook.\nSELECT *\nworks to retrieve the records, but applying a filter on the column with the\nchar(255)\ndata type does not retrieve any records.\nExample query\nSelect * from ..\r\nWhere  = “value”\nCause\nIn Databricks Runtime versions 14.3 LTS and below, the runtime tries to directly compare the\nchar(255)\ndata column name with the string value. Since it has a\nchar\ntype which has a length of 255 characters, it is unable to retrieve the result.\nSolution\nThere are three choices available.\nUse\ntrim\nto remove the blank space from the right side of the string value.\nWhere trim() = “value”\nCast the column name as a string type.\nWhere cast( as string) = “value”\nAlternatively, upgrade to Databricks Runtime 15.4 LTS or above. These versions of Databricks Runtime handle the\nchar(255)\ntype using the rpad function to right-pad the\ncolumn_name ‘name’\nwith spaces, to a total length of 255 characters. The issue therefore does not occur in Databricks Runtime 15.4 LTS or above." +} \ No newline at end of file diff --git a/scraped_kb_articles/find-long-running-queries-in-your-sql-warehouse.json b/scraped_kb_articles/find-long-running-queries-in-your-sql-warehouse.json new file mode 100644 index 0000000000000000000000000000000000000000..0594a127e9ebf13b8f71203c7129f3f8a1376426 --- /dev/null +++ b/scraped_kb_articles/find-long-running-queries-in-your-sql-warehouse.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/find-long-running-queries-in-your-sql-warehouse", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Databricks SQL warehouse continues running even when no new queries are submitted. You may have a queue of low-cost queries built up, impacting performance. You want to identify any long-running queries to prevent resource wastage and optimize query execution.\nCause\nLong-running queries keep a warehouse in use for an extended period of time.\nA query may run for an extended time due to:\nComplex joins, large datasets, or inefficient execution plans.\nHigh workload demand on the warehouse.\nInsufficient compute resources, leading to slower processing.\nQueries stuck in the queue, delaying execution.\nSolution\nThere are multiple ways to identify long-running queries.\nQuery history UI\nIn the workspace, on the left side panel under\nSQL\nclick\nQuery History\n.\nYou can filter queries by calendar time,\nCompute\n(by warehouse),\nDuration\n, and\nStatus\n(running, finished, etc.).\nInfo\nThe\nPeak query count\nchart, which appears in the\nMonitoring\ntab of a selected SQL warehouse, is often misinterpreted as showing all running queries. However, it only highlights queries with a non-trivial load, affecting autoscaling behavior. For a complete view of all running queries, use the Query History UI, the Databricks API, or system tables.\nQuery history API\nUse the\nList Queries\n(\nAWS\n|\nAzure\n|\nGCP\n) API to fetch long-running queries dynamically.\nThe example script helps identify queries that have exceeded 2 minutes of total execution time. It filters queries based on the specified warehouse and time period while also providing the total count of queries that surpass the duration threshold.\nExample code\n%python\r\n\r\nimport requests\r\nimport json\r\nimport time\r\n\r\n# Databricks Configuration\r\nDATABRICKS_INSTANCE = \"https://\"\r\nAPI_TOKEN = \"\"\r\nAPI_URL = f\"{DATABRICKS_INSTANCE}/api/2.0/sql/history/queries\"\r\n\r\n# Convert date/time string to milliseconds\r\ndef to_milliseconds(date_string, format=\"%Y-%m-%d %H:%M:%S\"):\r\n    return int(time.mktime(time.strptime(date_string, format)) * 1000)\r\n\r\n# Define the time range (modify as needed)\r\nstart_time_str = \"\"\r\nend_time_str = \"\"\r\n\r\n# Convert to milliseconds\r\nstart_time_ms = to_milliseconds(start_time_str)\r\nend_time_ms = to_milliseconds(end_time_str)\r\n\r\n# API Request Body\r\npayload = {\r\n    \"filter_by\": {\r\n        \"statuses\": [\"FINISHED\", \"RUNNING\"],  # Example: Filter by query statuses\r\n        \"warehouse_ids\": [\"\"],  # List of warehouse IDs\r\n        \"query_start_time_range\": {\r\n            \"start_time_ms\": start_time_ms,\r\n            \"end_time_ms\": end_time_ms\r\n        }\r\n    }\r\n}\r\n\r\n# Headers\r\nheaders = {\r\n    \"Authorization\": f\"Bearer {API_TOKEN}\",\r\n    \"Content-Type\": \"application/json\"\r\n}\r\n\r\n# Make the API request\r\nresponse = requests.get(API_URL, headers=headers, json=payload)\r\n\r\n# Handle response\r\nif response.status_code == 200:\r\n    data = response.json()\r\n    queries = data.get(\"res\", [])\r\n\r\n    # Filter queries where duration > 120000 ms (2 minutes)\r\n    long_queries = [q for q in queries if q.get(\"duration\", 0) > 120000]\r\n\r\n    # Print the filtered queries\r\n    print(\"\\nQueries with duration > 2 minutes:\\n\")\r\n    for query in long_queries:\r\n        print(json.dumps(query, indent=4))\r\n\r\n    print(f\"\\nTotal Queries with duration > 2 minutes: {len(long_queries)}\")\r\n\r\nelse:\r\n    print(f\"Error: {response.status_code} - {response.text}\")\nInfo\nFor proactive monitoring, schedule a job to call the query history API regularly and set up alerts based on the output. This can help you detect and act on long-running queries before they impact performance or cost.\nSystem table query history\nYou can retrieve information on long-running queries from the\nQuery history system table\n(\nAWS\n|\nAzure\n|\nGCP\n). Queries are observable when they reach a terminal state (\nFAILED\n,\nCANCELLED\n, or\nFINISHED\n).\nThe example SQL query retrieves long-running queries that have already finished or failed. It filters queries based on execution duration, warehouse, and time period.\nExample code\n%sql\r\n\r\nSELECT \r\n    statement_id, \r\n    statement_text, \r\n    executed_by, \r\n    execution_status, \r\n    start_time, \r\n    end_time, \r\n    total_duration_ms,\r\n    execution_duration_ms,\r\n    compilation_duration_ms,\r\n    result_fetch_duration_ms,\r\n    compute.warehouse_id\r\nFROM system.query_history \r\nWHERE \r\n    total_duration_ms > 2 * 60 * 1000 -- Queries running longer than 2 minutes\r\n    AND start_time >= TIMESTAMPADD(DAY, -8, CURRENT_TIMESTAMP) -- Queries from the last 8 days\r\n    AND (compute.warehouse_id = '') -- Filter by warehouse ID (modify as needed)\r\n    AND execution_status IN ('RUNNING', 'FINISHED') -- Running or completed queries\r\nORDER BY total_duration_ms DESC; -- Longest-running queries first" +} \ No newline at end of file diff --git a/scraped_kb_articles/find-size-of-table-snapshot.json b/scraped_kb_articles/find-size-of-table-snapshot.json new file mode 100644 index 0000000000000000000000000000000000000000..8175f5320046a71ccada1051ec9112ada4c41b50 --- /dev/null +++ b/scraped_kb_articles/find-size-of-table-snapshot.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/find-size-of-table-snapshot", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to find the size of a table snapshot.\nThe command used depends on if you are trying to find the size of a Delta table or a non-Delta table.\nSize of a Delta table snapshot\nTo find the size of a Delta table snapshot, you can use a Apache Spark SQL command.\n%scala\r\n\r\nimport com.databricks.sql.transaction.tahoe._\r\nval deltaLog = DeltaLog.forTable(spark, \"dbfs:/\")\r\nval snapshot = deltaLog.snapshot               // the current delta table snapshot\r\nprintln(s\"Total file size (bytes): ${deltaLog.snapshot.sizeInBytes}\")\nSize of a non-Delta table snapshot\nYou can determine the size of a non-Delta table snapshot by calculating the total sum of the individual files within the underlying directory.\nYou can also use\nqueryExecution.analyzed.stats\nto return the size.\n%scala\r\n\r\nspark.read.table(\"\").queryExecution.analyzed.stats" +} \ No newline at end of file diff --git a/scraped_kb_articles/find-the-number-of-files-per-partition-in-a-delta-table.json b/scraped_kb_articles/find-the-number-of-files-per-partition-in-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..f23828db8871dd05225585762b969bba28b5c9aa --- /dev/null +++ b/scraped_kb_articles/find-the-number-of-files-per-partition-in-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/find-the-number-of-files-per-partition-in-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou may need to find the number of files per partition in a Delta table to monitor partition sizes, optimize partitions, combine small files, and so on.\nCause\nIn Delta Lake, knowing the number of files per partition in a Delta table is useful for cost management reasons.\nStorage Costs: Excessive small files increase storage costs because of additional metadata and inefficient storage utilization.\nCompute Costs: Too many files can increase compute costs as Spark needs to manage and read more files, leading to higher job execution times and cluster costs.\nSolution\nYou can find the number of files per partition with the provided example code. Replace\n\nwith the full path to your Delta table before running the example code in a notebook.\nPython example code\nThis snippet retrieves the partition structure of a Delta table and displays the number of data files per partition value to help assess file distribution and potential optimization needs.\n%python\r\n\r\n#Fetch the schema of the partitions\r\nschema = spark.sql(\"show partitions delta.`dbfs:/`\").schema\r\n#Extract the column names\r\ncolumn_names = [field.name for field in schema.fields]\r\n# Run the SQL query to count distinct file paths grouped by the first column\r\ndisplay(spark.sql(f\"select {column_names[0]}, count(distinct _metadata.file_path) from delta.`dbfs:/` group by {column_names[0]}\" ))\nExample results\n+---+-----------------------------------+\n| id|count(DISTINCT _metadata.file_path)|\n+---+-----------------------------------+\n|  5|                                  1|\n|  1|                                  1|\n|  3|                                  1|\n|  2|                                  1|\n|  4|                                  1|\n+---+-----------------------------------+\nScala example code\nThis snippet retrieves the partition structure of a Delta table and displays the number of data files per partition value to help assess file distribution and potential optimization needs.\n%scala\r\n\r\n// Fetch the schema of the partitions\r\nval schema = spark.sql(\"SHOW PARTITIONS delta.`dbfs:/`\").schema\r\n\r\n// Extract the column names\r\nval columnNames = schema.fields.map(_.name)\r\n\r\n// Run the SQL query to count distinct file paths grouped by the first column\r\nval query = s\"SELECT ${columnNames(0)}, COUNT(DISTINCT _metadata.file_path) FROM delta.`dbfs:/` GROUP BY ${columnNames(0)}\"\r\nval result = spark.sql(query)\r\n\r\n// Display the result\r\nresult.show()\nExample results\n+---+-----------------------------------+\n| id|count(DISTINCT _metadata.file_path)|\n+---+-----------------------------------+\n|  5|                                  1|\n|  1|                                  1|\n|  3|                                  1|\n|  2|                                  1|\n|  4|                                  1|\n+---+-----------------------------------+" +} \ No newline at end of file diff --git a/scraped_kb_articles/find-your-metastore-id.json b/scraped_kb_articles/find-your-metastore-id.json new file mode 100644 index 0000000000000000000000000000000000000000..b163552943d10c1cd210ea7809ac2d62e4cbf9da --- /dev/null +++ b/scraped_kb_articles/find-your-metastore-id.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/find-your-metastore-id", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nHow can I find my metastore ID?\nCause\nThe metastore ID may be required for various reasons. For example, for quota increases or to append it to a new workspace via automation tools such as Terraform.\nSolution\nYou can find your metastore ID at the account level or the workspace level.\nThe metastore ID has the following format XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.\nAccount console level\nInfo\nYou must be a Databricks account administrator to view information in the account console.\nClick\nManage account\nin the workspace selector to access your\naccount console\n(\nAWS\n|\nAzure\n|\nGCP\n).\nClick\nCatalog\nin the sidebar.\nIn the\nName\ncolumn click the metastore you want to view.\nFrom here you can refer to the ID in different ways:\nUnder the\nbucket path\nsection, the metastore ID is the last part of the path reference.\n:///XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX\nIn the browser URL, the metastore ID is after data/ and before /configurations.\nhttps://accounts.cloud.databricks.com/data/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/configurations?account_id=\nWorkspace Level\nYou can view the metastore ID for a metastore that is already attached to the workspace (\nAWS\n|\nAzure\n|\nGCP\n) by using Catalog Explorer or system tables.\nUsing Catalog Explorer\nLogin to your workspace.\nClick\nCatalog\n(\nAWS\n|\nAzure\n|\nGCP\n) in the sidebar.\nWithin the Catalog Explorer, select any Catalog different from the\nhive_metastore\none.\nClick\nDetails\n.\nYour metastore ID is displayed in the\nMetastore Id\nfield.\nUsing system tables\nTo use this approach, you must have\nSELECT\npermissions against the\ncurrent metastore\n(\nAWS\n|\nAzure\n|\nGCP\n) and the\nsystem tables\n(\nAWS\n|\nAzure\n|\nGCP\n).\nLogin to your workspace.\nClick\nSQL Editor\n(\nAWS\n|\nAzure\n|\nGCP\n) in the sidebar to open the SQL editor or create a\nnotebook\n(\nAWS\n|\nAzure\n|\nGCP\n) with SQL code.\nAttach a\nUnity Catalog enabled compute resource\n(\nAWS\n|\nAzure\n|\nGCP\n).\nRun this query:\nSELECT metastore_id\r\nFROM system.information_schema.metastores;\nThe query output displays the metastore ID attached to the workspace." +} \ No newline at end of file diff --git a/scraped_kb_articles/find-your-workspace-id.json b/scraped_kb_articles/find-your-workspace-id.json new file mode 100644 index 0000000000000000000000000000000000000000..ac174aab57ef9390df08ca2a4e5e2ba523bd839d --- /dev/null +++ b/scraped_kb_articles/find-your-workspace-id.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/find-your-workspace-id", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/fit-spark-model-error.json b/scraped_kb_articles/fit-spark-model-error.json new file mode 100644 index 0000000000000000000000000000000000000000..cab6ec4c8c337fbf0fc42a34694a548512c5e667 --- /dev/null +++ b/scraped_kb_articles/fit-spark-model-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/fit-spark-model-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDatabricks throws an error when fitting a SparkML model or Pipeline:\norg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168, 10.205.250.130, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)\nCause\nOften, an error when fitting a SparkML model or Pipeline is a result of issues with the training data.\nSolution\nCheck for the following issues:\nIdentify and address NULL values in a dataset. Spark needs to know how to address missing values in the dataset.\nDiscard rows with missing values with dropna().\nImpute some value like zero or the average value of the column. This solution depends on what is meaningful for the data set.\nEnsure that all training data is appropriately transformed to a numeric format. Spark needs to know how to handle categorical and string variables. A variety of\nfeature transformers\nare available to address data specific cases.\nCheck for\ncollinearity\n. Highly correlated or even duplicate features may cause issues with model fitting. This occurs on rare occasions, but you should make sure to rule it out." +} \ No newline at end of file diff --git a/scraped_kb_articles/fixture-not-found-error-when-using-pytest-on-a-cluster-.json b/scraped_kb_articles/fixture-not-found-error-when-using-pytest-on-a-cluster-.json new file mode 100644 index 0000000000000000000000000000000000000000..68b1fe4585c4d73058ef620a574c158be3859f0f --- /dev/null +++ b/scraped_kb_articles/fixture-not-found-error-when-using-pytest-on-a-cluster-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/fixture-not-found-error-when-using-pytest-on-a-cluster-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you spin up your cluster, Databricks Runtime updates pytest to version 8.3.3. You notice Fixtures do not function as expected, and when you attempt to execute a test you receive an error message. Fixtures worked correctly in pytest version 8.3.2.\nExample of test execution\nimport ipytest \r\nimport pytest\r\n\r\ndef add(a, b):\r\n  return a+b\r\n\r\n@pytest.fixture\r\ndef input_data():\r\n  return (2, 3)\r\n\r\ndef test_add (input_data):\r\n  a, b = input_data\r\n  result = add(a, b)\r\n  assert result == 5\r\n\r\nipytest.run('-vv')\nExample of returned error\n____________________________________ ERROR at setup of test_add ____________________________________\r\nfile /home/spark-3e47400d-17a6-444e-ac05-ed/.ipykernel/56909/command-4400526906963776-1085653689, line 12\r\n  def test_add (input_data):\r\nE fixture 'input_data' not found\r\n> available fixtures: anyio_backend, anyio_backend_name, anyio_backend_options, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory\nCause\nDatabricks uses specialized IPython functionality that conflicts with pytest's execution environment in version 8.3.3.\nSolution\nUpdate Databricks Runtime to 16.1 or above for a built-in fix for this issue.\nIf you can’t or don’t want to update your Databricks Runtime, you can safely downgrade pytest to version 8.3.2 to continue using the library in your Databricks environment.\nNote\nAs with any third party library, downgrading from the most current version of the library may mean you miss library patches or important updates. Read the library release notes as needed." +} \ No newline at end of file diff --git a/scraped_kb_articles/flatten-nested-columns-dynamically.json b/scraped_kb_articles/flatten-nested-columns-dynamically.json new file mode 100644 index 0000000000000000000000000000000000000000..e76800f1dd8c651fb8c3d2bd4de68444c6f30c7a --- /dev/null +++ b/scraped_kb_articles/flatten-nested-columns-dynamically.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/flatten-nested-columns-dynamically", + "title": "Título do Artigo Desconhecido", + "content": "This article shows you how to flatten nested JSON, using only\n$\"column.*\"\nand\nexplode\nmethods.\nSample JSON file\nPass the sample JSON string to the reader.\n%scala\r\n\r\nval json =\"\"\"\r\n{\r\n        \"id\": \"0001\",\r\n        \"type\": \"donut\",\r\n        \"name\": \"Cake\",\r\n        \"ppu\": 0.55,\r\n        \"batters\":\r\n                {\r\n                        \"batter\":\r\n                                [\r\n                                        { \"id\": \"1001\", \"type\": \"Regular\" },\r\n                                        { \"id\": \"1002\", \"type\": \"Chocolate\" },\r\n                                        { \"id\": \"1003\", \"type\": \"Blueberry\" },\r\n                                        { \"id\": \"1004\", \"type\": \"Devil's Food\" }\r\n                                ]\r\n                },\r\n        \"topping\":\r\n                [\r\n                        { \"id\": \"5001\", \"type\": \"None\" },\r\n                        { \"id\": \"5002\", \"type\": \"Glazed\" },\r\n                        { \"id\": \"5005\", \"type\": \"Sugar\" },\r\n                        { \"id\": \"5007\", \"type\": \"Powdered Sugar\" },\r\n                        { \"id\": \"5006\", \"type\": \"Chocolate with Sprinkles\" },\r\n                        { \"id\": \"5003\", \"type\": \"Chocolate\" },\r\n                        { \"id\": \"5004\", \"type\": \"Maple\" }\r\n                ]\r\n}\r\n\"\"\"\nConvert to DataFrame\nAdd the JSON string as a collection type and pass it as an input to\nspark.createDataset\n. This converts it to a DataFrame. The JSON reader infers the schema automatically from the JSON string.\nThis sample code uses a list collection type, which is represented as\njson :: Nil\n. You can also use other Scala collection types, such as Seq (Scala Sequence).\n%scala\r\n\r\nimport org.apache.spark.sql.functions._\r\nimport spark.implicits._\r\nval DF= spark.read.json(spark.createDataset(json :: Nil))\nExtract and flatten\nUse\n$\"column.*\"\nand\nexplode\nmethods to flatten the struct and array types before displaying the flattened DataFrame.\n%scala\r\n\r\ndisplay(DF.select($\"id\" as \"main_id\",$\"name\",$\"batters\",$\"ppu\",explode($\"topping\")) // Exploding the topping column using explode as it is an array type\r\n        .withColumn(\"topping_id\",$\"col.id\") // Extracting topping_id from col using DOT form\r\n        .withColumn(\"topping_type\",$\"col.type\") // Extracting topping_tytpe from col using DOT form\r\n        .drop($\"col\")\r\n        .select($\"*\",$\"batters.*\") // Flattened the struct type batters tto array type which is batter\r\n        .drop($\"batters\")\r\n        .select($\"*\",explode($\"batter\"))\r\n        .drop($\"batter\")\r\n        .withColumn(\"batter_id\",$\"col.id\") // Extracting batter_id from col using DOT form\r\n        .withColumn(\"battter_type\",$\"col.type\") // Extracting battter_type from col using DOT form\r\n        .drop($\"col\")\r\n       )\nDelete\nWarning\nMake sure to use $ for all column names, otherwise you may get an error message:\noverloaded method value select with alternatives\n.\nExample notebook\nRun the\nNested JSON to DataFrame example notebook\nto view the sample code and results." +} \ No newline at end of file diff --git a/scraped_kb_articles/found-duplicate-columns-error-blocks-creation-of-a-delta-table.json b/scraped_kb_articles/found-duplicate-columns-error-blocks-creation-of-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..67adf7ea420c2e14c91573ee428a81d0a315ccdc --- /dev/null +++ b/scraped_kb_articles/found-duplicate-columns-error-blocks-creation-of-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/found-duplicate-columns-error-blocks-creation-of-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an array of struct columns with one or more duplicate column names in a DataFrame.\nIf you try to create a Delta table you get a\nFound duplicate column(s) in the data to save:\nerror.\nExample code\nYou can reproduce the error with this example code.\n1) The first step sets up an array with duplicate column names. The duplicate columns are identified by comments in the sample code.\n%scala\r\n\r\n// Sample json file to test to_json function\r\nval arrayStructData = Seq(\r\n Row(\"James\",List(Row(\"Java\",\"XX\",120,\"Java\"),Row(\"Scala\",\"XA\",300,\"Scala\"))),\r\n Row(\"Michael\",List(Row(\"Java\",\"XY\",200,\"Java\"),Row(\"Scala\",\"XB\",500,\"Scala\"))),\r\n Row(\"Robert\",List(Row(\"Java\",\"XZ\",400,\"Java\"),Row(\"Scala\",\"XC\",250,\"Scala\")))\r\n )\r\n\r\nimport org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, ArrayType};\r\nval arrayStructSchema = new StructType().add(\"name\",StringType)\r\n   .add(\"booksIntersted\",ArrayType(new StructType()\r\n    .add(\"name\",StringType) // Duplicate column\r\n    .add(\"author\",StringType)\r\n    .add(\"pages\",IntegerType)\r\n    .add(\"Name\",StringType))) // Duplicate column\r\n\r\nval df = spark.createDataFrame(spark.sparkContext\r\n    .parallelize(arrayStructData),arrayStructSchema)\r\n\r\ndf.printSchema // df with struct type\n2) After validating the DataFrame, we try to create a Delta table and get the\nFound duplicate column(s) in the data to save:\nerror.\n%scala\r\n\r\ndf.createOrReplaceTempView(\"df\")\r\ndf.write.format(\"delta\").save(\"/mnt/delta/test/df_issue\")\r\nspark.sql(\"create table events using delta location '/mnt/delta/test/df_issue'\")\nCause\nAn array of struct columns containing duplicate columns with the same name cannot be present in a Delta table. This is true even if the names are in different cases.\nDelta Lake is case-preserving, but case-insensitive, when storing a schema.\nIn order to avoid potential data corruption or data loss, duplicate column names are not allowed.\nSolution\nThis approach involves converting the parent column which has duplicate column names to a json string.\n1) You need to convert the\nstructtype\ncolumns to\nstring\nusing the\nto_json()\nfunction before creating the Delta table.\n%scala\r\n\r\nimport org.apache.spark.sql.functions.to_json\r\nval df1 = df.select(df(\"name\"), to_json(df(\"booksIntersted\")).alias(\"booksIntersted_string\")) // Use this\r\n\r\ndf1.write.format(\"delta\").save(\"/mnt/delta/df1_solution\")\r\nspark.sql(\"create table events_solution using delta location '/mnt/delta/df1_solution'\")\r\n\r\nspark.sql(\"describe events_solution\").show()\n2) Use the\nget_json_object()\nfunction to extract information from the converted string type column.\n%scala\r\n\r\nimport org.apache.spark.sql.functions.get_json_object\r\nval df2 = df1.select(df1(\"name\"), get_json_object(df1(\"bookInterested\"), \"${0}.author\"), get_json_object(df1(\"bookInterested\"), \"${0}.pages\"))\r\n\r\ndisplay(df2)" +} \ No newline at end of file diff --git a/scraped_kb_articles/from-json-null-spark3.json b/scraped_kb_articles/from-json-null-spark3.json new file mode 100644 index 0000000000000000000000000000000000000000..5ba28081dc204c91a5dd61442ecbb7ee60e04995 --- /dev/null +++ b/scraped_kb_articles/from-json-null-spark3.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/from-json-null-spark3", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe\nfrom_json\nfunction is used to parse a JSON string and return a struct of values.\nFor example, if you have the JSON string\n[{\"id\":\"001\",\"name\":\"peter\"}]\n, you can pass it to\nfrom_json\nwith a schema and get parsed struct values in return.\n%python\r\n\r\nfrom pyspark.sql.functions import col, from_json\r\ndisplay(\r\n  df.select(col('value'), from_json(col('value'), json_df_schema, {\"mode\" : \"PERMISSIVE\"}))\r\n)\nIn this example, the dataframe contains a column “value”, with the contents\n[{“id”:”001”,”name”:”peter”}]\nand the schema is\nStructType(List(StructField(id,StringType,true),StructField(name,StringType,true)))\n.\nThis works correctly on Spark 2.4 and below (Databricks Runtime 6.4 ES and below).\n* id:\r\n  \"001\"\r\n* name:\r\n  \"peter\"\nThis returns null values on Spark 3.0 and above (Databricks Runtime 7.3 LTS and above).\n* id:\r\n  null\r\n* name:\r\n  null\nCause\nThis occurs because Spark 3.0 and above cannot parse JSON arrays as structs.\nYou can confirm this by running\nfrom_json\nin\nFAILFAST\nmode.\n%python\r\n\r\nfrom pyspark.sql.functions import col, from_json\r\ndisplay(\r\n  df.select(col('value'), from_json(col('value'), json_df_schema, {\"mode\" : \"FAILFAST\"}))\r\n)\nThis returns an error message that defines the root cause.\nCaused by: RuntimeException: Parsing JSON arrays as structs is forbidden\nSolution\nYou must pass the schema as\nArrayType\ninstead of\nStructType\nin Databricks Runtime 7.3 LTS and above.\n%python\r\n\r\nfrom pyspark.sql.types import StringType, ArrayType, StructType, StructField\r\nschema_spark_3 = ArrayType(StructType([StructField(\"id\",StringType(),True),StructField(\"name\",StringType(),True)]))\r\n\r\n\r\nfrom pyspark.sql.functions import col, from_json\r\ndisplay(\r\n  df.select(col('value'), from_json(col('value'), schema_spark_3, {\"mode\" : \"PERMISSIVE\"}))\r\n)\nIn this example code, the previous\nStructType\nschema is enclosed in\nArrayType\nand the new schema is used with\nfrom_json\n.\nThis parses the JSON string correctly and returns the expected values." +} \ No newline at end of file diff --git a/scraped_kb_articles/function-ai_similarity-failing-with-unexpected-server-response-error.json b/scraped_kb_articles/function-ai_similarity-failing-with-unexpected-server-response-error.json new file mode 100644 index 0000000000000000000000000000000000000000..070663464cdecf86d8334c3a2d28d880457c74ed --- /dev/null +++ b/scraped_kb_articles/function-ai_similarity-failing-with-unexpected-server-response-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/function-ai_similarity-failing-with-unexpected-server-response-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to use the\nai_similarity\nfunction to calculate the similarity score between two strings, you receive an Unexpected server response error.\nExample code\nSELECT ai_similarity('Apache Spark', NULL);\nExample error message\n[AI_FUNCTION_INVALID_HTTP_RESPONSE] Invalid HTTP response for function `ai_similarity`: Unexpected server response: Length of the left (1) is different from the right (0) SQLSTATE: 08000\nCause\nYou’re passing NULL as an argument, which means you’re providing input data with NULL values to the\nai_similarity\nfunction. The function only takes STRING expressions as arguments.\nSolution\nFilter out all NULL values from the input data. For details, refer to the\nai_similarity\nfunction\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/function-object-no-attribute.json b/scraped_kb_articles/function-object-no-attribute.json new file mode 100644 index 0000000000000000000000000000000000000000..686435567c9d56f24ae828933bb5762de62ac3f5 --- /dev/null +++ b/scraped_kb_articles/function-object-no-attribute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/function-object-no-attribute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are selecting columns from a DataFrame and you get an error message.\nERROR: AttributeError: 'function' object has no attribute '_get_object_id' in job\nCause\nThe DataFrame API contains a small number of protected keywords.\nIf a column in your DataFrame uses a protected keyword as the column name, you will get an error message.\nFor example, summary is a protected keyword. If you use summary as a column name, you will see the error message.\nThis sample code uses summary as a column name and generates the error message when run.\n%python\r\n\r\ndf=spark.createDataFrame([1,2], \"int\").toDF(\"id\")\r\ndf.show()\r\nfrom pyspark.sql.types import StructType,StructField, StringType, IntegerType\r\n\r\ndf1 = spark.createDataFrame(\r\n  [(10,), (11,), (13,)],\r\n  StructType([StructField(\"summary\", IntegerType(), True)]))\r\n\r\ndf1.show()\r\n\r\nResultDf = df1.join(df, df1.summary == df.id, \"inner\").select(df.id,df1.summary)\r\nResultDf.show()\nSolution\nYou should not use\nDataFrame API protected keywords\nas column names.\nIf you must use protected keywords, you should use bracket based column access when selecting columns from a DataFrame. Do not use dot notation when selecting columns that use protected keywords.\n%python\r\n\r\nResultDf = df1.join(df, df1[\"summary\"] == df.id, \"inner\").select(df.id,df1[\"summary\"])" +} \ No newline at end of file diff --git a/scraped_kb_articles/ganglia-metrics-not-appearing-in-historical-metrics-snapshots-list.json b/scraped_kb_articles/ganglia-metrics-not-appearing-in-historical-metrics-snapshots-list.json new file mode 100644 index 0000000000000000000000000000000000000000..0778fea9d09ed3815f548230312617e9cea2d1b3 --- /dev/null +++ b/scraped_kb_articles/ganglia-metrics-not-appearing-in-historical-metrics-snapshots-list.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/ganglia-metrics-not-appearing-in-historical-metrics-snapshots-list", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you navigate to your\nMetrics\ntab to access the Ganglia metrics for a cluster, you notice you have zero files in your\nHistorical metrics snapshots\nlist. Within the file list, you see a message, “No metrics found.”\nCause\nGanglia snapshots are taken every 15 minutes. If the cluster runs for fewer minutes, and outside the snapshot-taken frame, Ganglia metrics will not be collected.\nNote\nGanglia metrics\nare only available for Databricks Runtime 12.2 LTS and below for AWS and Azure. Ganglia is not supported on Databricks on Google Cloud.\nSolution\nDatabricks recommends using the\nCompute metrics\nfeature introduced as of Databricks Runtime 13.0, where metrics are collected every minute.\nFor more information, please review the\nManage compute\n(\nAWS\n|\nAzure\n)\ndocumentation.\nThe new compute metrics UI has a more comprehensive view of your cluster’s resource usage, including Spark consumption and internal Databricks processes. In contrast, the Ganglia UI only measures Spark container consumption. For additional information about how to use these new metrics please refer to the\nView compute metrics\n(\nAWS\n|\nAzure\n)\ndocumentation.\nIf you need to continue using Databricks Runtime 9.1 LTS - 12.2 LTS, configure the collection period using the cluster UI:\nSet the\nDATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES\nenvironment variable.\nAdvance Options\n>\nSpark\n>\nEnvironment Variables.\nAdd the required time.\nDATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES=5\nAlternatively, use the API to set the required time using\nthe\nspark_env_vars\nfield. For more information, please review the\nClusters API\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/gdal-library-installation.json b/scraped_kb_articles/gdal-library-installation.json new file mode 100644 index 0000000000000000000000000000000000000000..f98c1f1789cbbcc78751de5d94a8c59cc7a45c4e --- /dev/null +++ b/scraped_kb_articles/gdal-library-installation.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/gdal-library-installation", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to install the GDAL library on a cluster using an init script, the cluster doesn’t start and on cluster events you see an error message.\n{\r\n\"reason\": {\r\n\"code\": \"INIT_SCRIPT_FAILURE\",\r\n\"type\": \"CLIENT_ERROR\",\r\n\"parameters\": {\r\n\"instance_id\": \"\",\r\n**\"databricks_error_message\": \"Cluster scoped init script failed: Script exit status is non-zero\"**\r\n}\r\n}\r\n}\nYou may also get an error about failing to reach a Ubuntu website.\nFailed to fetch http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease Cannot initiate the connection to security.ubuntu.com:80 (2620:2d:4002:1::103). - connect (101: Network is unreachable).\nCause\nThe first error message can be caused by a misconfigured init script.\nThe error message about not reaching the Ubuntu website can happen if the package repository only uses http by default.\napt-get\ncan use HTTPS but the repository must support HTTPS/SSL for this to work. The main package repository at archive.ubuntu.com does not. If\napt-get\ntries to connect via HTTPS and it is not supported, the connection fails with an error.\nSolution\nThese steps resolve most GDAL installation issues.\n1. Verify that you are using Databricks Runtime version 13.3 LTS or above.\n2. Make sure you are following the\nofficial Mosaic GDAL installation guide\n. This guide provides an up-to-date method for installing GDAL on Databricks clusters.\n3. Create the init script based on the instructions in the Mosaic guide. The guide includes a Python function\nsetup_gdal()\nthat generates an init script tailored to your environment.\nTo generate the init script run the following code:\n# First install mosaic\r\n\r\n%pip install databricks-mosaic\r\n\r\n# Then in another cell import the library and run the setup_gdal() function\r\n\r\n%python\r\n\r\nimport mosaic as mos\r\nmos.enable_mosaic(spark, dbutils)\r\nmos.setup_gdal()\nThis is an example init script generated with\nsetup_gdal()\n.\n#!/bin/bash\r\n# --\r\n# This is for Ubuntu 22.04 (Jammy)\r\n# [1] corresponds to DBR 13+\r\n# [2] jammy offers GDAL 3.4.1\r\n# [3] see Mosaic functions (python) to configure\r\n# and pre-stage resources:\r\n# - setup_fuse_install(...) and\r\n# - setup_gdal(...)\r\n# [4] this script has conditional logic based on variables\r\n# [5] stripped back in Mosaic 0.4.2+\r\n# Author: Michael Johns | mjohns@databricks.com\r\n# Last Modified: 29 APR, 2024\r\n\r\n# TEMPLATE-BASED REPLACEMENT\r\n# - can also be manually specified\r\nFUSE_DIR='/Workspace/Shared/geospatial/mosaic/gdal/jammy/0.4.2'\r\n\r\n# CONDITIONAL LOGIC\r\nWITH_FUSE_SO=0 # <- use fuse dir shared objects (vs wget)\r\n\r\n# refresh package info\r\n# 0.4.2 - added \"-y\"\r\nsudo apt-add-repository -y \"deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-backports main universe multiverse restricted\"\r\nsudo apt-add-repository -y \"deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-updates main universe multiverse restricted\"\r\nsudo apt-add-repository -y \"deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-security main multiverse restricted universe\"\r\nsudo apt-add-repository -y \"deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) main multiverse restricted universe\"\r\nsudo apt-get update -y\r\n\r\n# install natives\r\n# 0.4.2 added package lock wait (can change value)\r\nsudo apt-get -o DPkg::Lock::Timeout=-1 install -y unixodbc libcurl3-gnutls libsnappy-dev libopenjp2-7\r\nsudo apt-get -o DPkg::Lock::Timeout=-1 install -y gdal-bin libgdal-dev python3-numpy python3-gdal\r\n\r\n# pip install gdal\r\n# matches jammy version\r\npip install --upgrade pip\r\npip install gdal==3.4.1\r\n\r\n# add pre-build JNI shared object to the path\r\nif [ $WITH_FUSE_SO == 1 ]\r\nthen\r\n# copy from fuse dir with no-clobber\r\nsudo cp -n $FUSE_DIR/libgdalalljni.so /usr/lib\r\nsudo cp -n $FUSE_DIR/libgdalalljni.so.30 /usr/lib\r\nsudo cp -n $FUSE_DIR/libgdalalljni.so.30.0.3 /usr/lib\r\nelse\r\n# copy from github\r\nGITHUB_REPO_PATH=databrickslabs/mosaic/main/resources/gdal/jammy\r\nsudo wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so\r\nsudo wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so.30\r\nsudo wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so.30.0.3\r\nfi\n4. Install the GDAL init script as a\ncluster-scoped init script\n(\nAWS\n|\nAzure\n|\nGCP\n).\n5. If you are seeing any library incompatibilities, review any\nglobal init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) that are enabled in your workspace and ensure they are not causing a conflict.\nInfo\nYou must be a workspace admin to create or see global init scripts. Be careful when editing any global init script as it impacts all workloads running in the workspace.\n6. If the GDAL init script is failing due to\narchive.ubuntu.com/ubuntu\nbeing unreachable, you need to use a HTTPS enabled mirror. You will need to update the GDAL init script, replacing\narchive.ubuntu.com/ubuntu\nwith the address of the mirror you want to use.\nPartial mirror list\nAsia\nhttps://ftp.sjtu.edu.cn/ubuntu/\nhttps://mirrors.tuna.tsinghua.edu.cn/ubuntu/\nEurope\nhttps://mirror.vorboss.net/ubuntu-archive/\nhttps://mirror.plusserver.com/ubuntu/ubuntu/\nUnited States\nhttps://lug.mtu.edu/ubuntu/\nhttps://mirrors.bloomu.edu/ubuntu/\nhttps://mirrors.ocf.berkeley.edu/ubuntu/\nhttps://mirror.umd.edu/ubuntu/\nhttps://mirrors.xmission.com/ubuntu/" +} \ No newline at end of file diff --git a/scraped_kb_articles/gen-unique-increasing-values.json b/scraped_kb_articles/gen-unique-increasing-values.json new file mode 100644 index 0000000000000000000000000000000000000000..1b6246c2365f34b100c83afd3e03514a19cad13f --- /dev/null +++ b/scraped_kb_articles/gen-unique-increasing-values.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/gen-unique-increasing-values", + "title": "Título do Artigo Desconhecido", + "content": "This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column.\nWe review three different methods to use. You should select the method that works best with your use case.\nUse\nzipWithIndex()\nin a Resilient Distributed Dataset (RDD)\nThe\nzipWithIndex()\nfunction is only available within RDDs. You cannot use it directly on a DataFrame.\nConvert your DataFrame to a RDD, apply\nzipWithIndex()\nto your data, and then convert the RDD back to a DataFrame.\nWe are going to use the following example code to add unique id numbers to a basic table with two entries.\n%python\r\n\r\ndf = spark.createDataFrame(\r\n    [\r\n        ('Alice','10'),('Susan','12')\r\n    ],\r\n    ['Name','Age']\r\n)\r\n\r\n\r\ndf1=df.rdd.zipWithIndex().toDF()\r\ndf2=df1.select(col(\"_1.*\"),col(\"_2\").alias('increasing_id'))\r\ndf2.show()\nRun the example code and we get the following results:\n+-----+---+-------------+\r\n| Name|Age|increasing_id|\r\n+-----+---+-------------+\r\n|Alice| 10|            0|\r\n|Susan| 12|            1|\r\n+-----+---+-------------+\nUse\nmonotonically_increasing_id()\nfor unique, but not consecutive numbers\nThe\nmonotonically_increasing_id()\nfunction generates monotonically increasing 64-bit integers.\nThe generated id numbers are guaranteed to be increasing and unique, but they are not guaranteed to be consecutive.\nWe are going to use the following example code to add monotonically increasing id numbers to a basic table with two entries.\n%python\r\n\r\nfrom pyspark.sql.functions import *\r\n\r\ndf_with_increasing_id = df.withColumn(\"monotonically_increasing_id\", monotonically_increasing_id())\r\ndf_with_increasing_id.show()\nRun the example code and we get the following results:\n+-----+---+---------------------------+\r\n| Name|Age|monotonically_increasing_id|\r\n+-----+---+---------------------------+\r\n|Alice| 10|                 8589934592|\r\n|Susan| 12|                25769803776|\r\n+-----+---+---------------------------+\nCombine\nmonotonically_increasing_id()\nwith\nrow_number()\nfor two columns\nThe\nrow_number()\nfunction generates numbers that are consecutive.\nCombine this with\nmonotonically_increasing_id()\nto generate two columns of numbers that can be used to identify data entries.\nWe are going to use the following example code to add monotonically increasing id numbers and row numbers to a basic table with two entries.\n%python\r\n\r\nfrom pyspark.sql.functions import *\r\nfrom pyspark.sql.window import *\r\n\r\nwindow = Window.orderBy(col('monotonically_increasing_id'))\r\ndf_with_consecutive_increasing_id = df_with_increasing_id.withColumn('increasing_id', row_number().over(window))\r\ndf_with_consecutive_increasing_id.show()\nRun the example code and we get the following results:\n+-----+---+---------------------------+-------------+\r\n| Name|Age|monotonically_increasing_id|increasing_id|\r\n+-----+---+---------------------------+-------------+\r\n|Alice| 10|                 8589934592|            1|\r\n|Susan| 12|                25769803776|            2|\r\n+-----+---+---------------------------+-------------+\nIf you need to increment based on the last updated maximum value, you can define a previous maximum value and then start counting from there.\nWe’re going to build on the example code that we just ran.\nFirst, we need to define the value of\nprevious_max_value\n. You would normally do this by fetching the value from your existing output table. For this example, we are going to define it as 1000.\n%python\r\n\r\nprevious_max_value = 1000\r\ndf_with_consecutive_increasing_id.withColumn(\"cnsecutiv_increase\", col(\"increasing_id\") + lit(previous_max_value)).show()\nWhen this is combined with the previous example code and run, we get the following results:\n+-----+---+---------------------------+-------------+------------------+\r\n| Name|Age|monotonically_increasing_id|increasing_id|cnsecutiv_increase|\r\n+-----+---+---------------------------+-------------+------------------+\r\n|Alice| 10|                 8589934592|            1|              1001|\r\n|Susan| 12|                25769803776|            2|              1002|\r\n+-----+---+---------------------------+-------------+------------------+" +} \ No newline at end of file diff --git a/scraped_kb_articles/generate-a-list-of-all-workspace-admins.json b/scraped_kb_articles/generate-a-list-of-all-workspace-admins.json new file mode 100644 index 0000000000000000000000000000000000000000..caa21930af06186efc4bf80d1cd4a61f6fa129ed --- /dev/null +++ b/scraped_kb_articles/generate-a-list-of-all-workspace-admins.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/generate-a-list-of-all-workspace-admins", + "title": "Título do Artigo Desconhecido", + "content": "Workspace administrators have full privileges to manage a workspace. This includes adding and removing users, as well as managing all of the data resources (jobs, libraries, notebooks, repos, etc.) in the workspace.\nDelete\nInfo\nYou must be a workspace administrator to perform the steps detailed in this article.\nIf you are a workspace admin, you can view other workspace admins in the workspace UI by using\nAdmin Settings\n.\nClick your username in the top bar of the workspace.\nClick\nAdmin Settings\n.\nClick\nGroups\n.\nClick the\nadmins\ngroup.\nReview the list of workspace admins.\nWhile the UI method works for most situations, in certain use cases you may need to get a list of workspace admins from within a notebook.\nThis article provides a sample code that you can use to list all of the workspace admins in the current workspace.\nInstructions\nWe can use the following code to fetch the workspace admin users.\nDelete\nInfo\nTo get your workspace URL, review\nWorkspace instance names, URLs, and IDs\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nGenerate a personal access token\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details on how to create a personal access token for use with the REST APIs.\nCopy and paste the sample code into a notebook cell.\nReplace the\n\nand\n\nvalues with ones specific to your workspace.\nRun the cell to generate a list of all the admin users in your workspace.\n%python\r\n\r\nimport requests\r\naccess_token= \"\"\r\n\r\nheaders={\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\nresponse = requests.get(f\"https:///api/2.0/preview/scim/v2/Groups\", headers=headers) \r\n\r\ngroups = response.json()\r\nfor admin in groups['Resources']:\r\n if admin.get('displayName')== 'admins':\r\n admin_group_id=admin.get('id')\r\n print(admin_group_id)\r\n\r\nresponse = requests.get(f\"https:///api/2.0/preview/scim/v2/Groups/{admin_group_id}\", headers=headers) \r\nadmin_group = response.json() \r\nmembers = admin_group.get('members', []) \r\nadmin_users = [] \r\nfor member in members: \r\n admin_users.append(member.get('display'))\r\n\r\nadmin_users.sort()\r\nprint(admin_users)" +} \ No newline at end of file diff --git a/scraped_kb_articles/generate-browser-har-files.json b/scraped_kb_articles/generate-browser-har-files.json new file mode 100644 index 0000000000000000000000000000000000000000..ff63ff3f06d69cbae721bd0d384b47ecfb1cb5d2 --- /dev/null +++ b/scraped_kb_articles/generate-browser-har-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/generate-browser-har-files", + "title": "Título do Artigo Desconhecido", + "content": "When troubleshooting UI issues, it is sometimes necessary to obtain additional information about the network requests that are generated in your browser. If this is needed, our support team will ask you to generate a HAR file.\nThis article describes how to generate a HAR file with each of the major web browsers.\nDelete\nWarning\nHAR files contain sensitive data, including the content of the pages you download while recording, potentially including sensitive data such as your browser cookies, session tokens, and private keys, which can grant access to your accounts.\nYou should sanitize all sensitive data (including cookies, credentials, and tokens) before sharing a HAR file with a trusted partner. You should NEVER share a HAR file publicly.\nChrome\nOpen Google Chrome and navigate to the page you want to record.\nRight-click on the page, and then click\nInspect\n.\nClick the\nNetwork\ntab.\nLook for the\nRecord\nbutton (\n) in the upper left corner of the frame. It should be red. If the\nRecord\nbutton is grey, click it once to start recording.\nCheck the\nPreserve log\nbox.\nClick the\nClear\nbutton (\n). This removes any existing logs from the tab.\nReproduce the issue while the network requests are being recorded.\nYou will see session output in the frame.\nOnce you have reproduced the issue, click the\nExport HAR\nbutton (\n).\nYou are prompted to save the file on your computer.\nSave the HAR file.\nAttach the HAR file to your ticket.\nEdge\nOpen Microsoft Edge and navigate to the page you want to record.\nRight-click on the page, and then click\nInspect\n.\nClick the\nNetwork\ntab.\nLook for the\nRecord\nbutton (\n) in the upper left corner of the frame. It should be red. If the\nRecord\nbutton is grey, click it once to start recording.\nCheck the\nPreserve log\nbox.\nClick the\nClear\nbutton (\n). This removes any existing logs from the tab.\nReproduce the issue while the network requests are being recorded.\nYou will see session output in the frame.\nOnce you have reproduced the issue, click the Export HAR button (\n).\nYou are prompted to save the file on your computer.\nSave the HAR file.\nAttach the HAR file to your ticket.\nFirefox\nOpen Firefox and navigate to the page you want to record.\nLook for the the Firefox menu in the top-right.\nClick\nMore Tools\n.\nClick\nWeb Developer Tools\n.\nThe\nDeveloper Network Tools\npanel opens.\nClick the\nNetwork\ntab.\nStart performing actions in the browser. Recording starts automatically.\nOnce you have reproduced the issue, right-click on the gear (⚙).\nClick\nSave all as HAR\n.\nYou are prompted to save the file on your computer.\nSave the HAR file.\nAttach the HAR file to your ticket.\nSafari\nYou need to enable the\nDevelop Menu\nin Safari before you can access the developer console.\nClick\nSafari\nin the menu bar\nClick\nPreferences\n.\nClick the\nAdvanced\ntab.\nSelect\nShow Develop menu in menu bar\n.\nClick\nDevelop\nin the menu bar.\nClick\nShow Web Inspector\n.\nClick the\nNetwork\ntab.\nStart performing actions in the browser. Recording starts automatically.\nOnce you have reproduced the issue, click\nExport.\nYou are prompted to save the file on your computer.\nSave the HAR file.\nAttach the HAR file to your ticket." +} \ No newline at end of file diff --git a/scraped_kb_articles/geospark-undefined-function-error-dbconnect.json b/scraped_kb_articles/geospark-undefined-function-error-dbconnect.json new file mode 100644 index 0000000000000000000000000000000000000000..7fff5813a1a3b7f5752c6f76672ce42aadefa2b6 --- /dev/null +++ b/scraped_kb_articles/geospark-undefined-function-error-dbconnect.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/geospark-undefined-function-error-dbconnect", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use the GeoSpark function\nst_geofromwkt\nwith DBConnect (\nAWS\n|\nAzure\n|\nGCP\n) and you get an Apache Spark error message.\nError: org.apache.spark.sql.AnalysisException: Undefined function: 'st_geomfromwkt'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.;\nThis example code fails with the error when used with DBConnect.\n%scala\r\n\r\nval sc = spark.sparkContext\r\nsc.setLogLevel(\"DEBUG\")\r\n\r\nval sqlContext = spark.sqlContext\r\nspark.sparkContext.addJar(\"~/jars/geospark-sql_2.3-1.2.0.jar\")\r\nspark.sparkContext.addJar(\"~/jars/geospark-1.2.0.jar\")\r\n\r\nGeoSparkSQLRegistrator.registerAll(sqlContext)\r\nprintln(spark.sessionState.functionRegistry.listFunction)\r\n\r\nspark.sql(\"select ST_GeomFromWKT(area) AS geometry from polygon\").show()\nCause\nDBConnect does not support auto-sync of client side UDFs to the server.\nSolution\nYou can use a custom utility jar with code that registers the UDF on the cluster using the\nSparkSessionExtensions\nclass.\nCreate a utility jar that registers GeoSpark functions using\nSparkSessionExtensions\n. This utility class definition can be built into a utility jar.\n%scala\r\n\r\npackage com.databricks.spark.utils\r\n\r\nimport org.apache.spark.sql.SparkSessionExtensions\r\nimport org.datasyslab.geosparksql.utils.GeoSparkSQLRegistrator\r\n\r\nclass GeoSparkUdfExtension extends (SparkSessionExtensions => Unit) {\r\n  def apply(e: SparkSessionExtensions): Unit = {\r\n    e.injectCheckRule(spark => {\r\n      println(\"INJECTING UDF\")\r\n      GeoSparkSQLRegistrator.registerAll(spark)\r\n      _ => Unit\r\n    })\r\n  }\r\n}\nCopy the GeoSpark jars and your utility jar to DBFS at\ndbfs:/databricks/geospark-extension-jars/\n.\nCreate an init script (\nset_geospark_extension_jar.sh\n) that copies the jars from the DBFS location to the Spark class path and sets the\nspark.sql.extensions\nto the utility class.\n%scala\r\n\r\ndbutils.fs.put(\r\n    \"dbfs:/databricks//set_geospark_extension_jar.sh\",\r\n    \"\"\"#!/bin/sh\r\n      |sleep 10s\r\n      |# Copy the extension and GeoSpark dependency jars to /databricks/jars.\r\n      |cp -v /dbfs/databricks/geospark-extension-jars/{spark_geospark_extension_2_11_0_1.jar,geospark_sql_2_3_1_2_0.jar,geospark_1_2_0.jar} /databricks/jars/\r\n      |# Set the extension.\r\n      |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf\r\n      |[driver] {\r\n      |    \"spark.sql.extensions\" = \"com.databricks.spark.utils.GeoSparkUdfExtension\"\r\n      |}\r\n      |EOF\r\n      |\"\"\".stripMargin,\r\n    overwrite = true\r\n)\nInstall the init script as a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n). You will need the full path to the location of the script (\ndbfs:/databricks//set_geospark_extension_jar.sh\n).\nReboot your cluster.\nYou can now use GeoSpark code with DBConnect." +} \ No newline at end of file diff --git a/scraped_kb_articles/get-file-path-auto-loader.json b/scraped_kb_articles/get-file-path-auto-loader.json new file mode 100644 index 0000000000000000000000000000000000000000..413672395189a008406ce0fdab12141692c16150 --- /dev/null +++ b/scraped_kb_articles/get-file-path-auto-loader.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/get-file-path-auto-loader", + "title": "Título do Artigo Desconhecido", + "content": "When you process streaming files with Auto Loader (\nAWS\n|\nAzure\n|\nGCP\n), events are logged based on the files created in the underlying storage.\nThis article shows you how to add the file path for every filename to a new column in the output DataFrame.\nOne use case for this is auditing. When files are ingested to a partitioned folder structure there is often useful metadata, such as the timestamp, which can be extracted from the path for auditing purposes.\nFor example, assume a file path and filename of\n2020/2021-01-01/file1_T191634.csv\n.\nFrom this path you can apply custom UDFs and use regular expressions to extract details like the date (2021-01-01) and the timestamp (T191634).\nThe following example code uses\ninput_file_name()\nget the path and filename for every row and write it to a new column named\nfilePath\n.\n%scala\r\n\r\nval df = spark.readStream.format(\"cloudFiles\")\r\n  .schema(schema)\r\n  .option(\"cloudFiles.format\", \"csv\")\r\n  .option(\"cloudFiles.region\",\"ap-south-1\")\r\n  .load(\"path\")\r\n  .withColumn(\"filePath\",input_file_name())" +} \ No newline at end of file diff --git a/scraped_kb_articles/get-last-modification-time-for-all-files-in-auto-loader-and-batch-jobs.json b/scraped_kb_articles/get-last-modification-time-for-all-files-in-auto-loader-and-batch-jobs.json new file mode 100644 index 0000000000000000000000000000000000000000..7730ce1ad686c6e71043b73a36b2613730d145c4 --- /dev/null +++ b/scraped_kb_articles/get-last-modification-time-for-all-files-in-auto-loader-and-batch-jobs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/get-last-modification-time-for-all-files-in-auto-loader-and-batch-jobs", + "title": "Título do Artigo Desconhecido", + "content": "You are running a streaming job with Auto Loader (\nAWS\n|\nAzure\n|\nGCP\n) and want to get the last modification time for each file from the storage account.\nInstructions\nThe\nGet the path of files consumed by Auto Loader\narticle describes how to get the filenames and paths for all files consumed by the Auto Loader. In this article, we build on that foundation and use sample code to show you how to apply a custom UDF and then extract the last modification time for the file.\nStart out by defining your imports and variables. You need to define the\n\n, as well as the\n\n, and\n\nyou are using.\nimport org.apache.hadoop.conf.Configuration\r\nimport org.apache.hadoop.fs.{FileStatus, FileSystem, Path}\r\nimport org.apache.spark.sql.functions.{input_file_name, col, udf, from_unixtime}\r\nimport org.apache.spark.sql.types._\r\nval basePath = \"\"\r\nval inputLocation = basePath + \"\"\r\nval outputLocation = basePath + \"\"\nFor this example, we need to generate sample data and store it in a DataFrame. In a practical use case, you would be reading data from your storage bucket.\nimport org.apache.spark.sql.types._\r\n\r\nval sampleData = Seq(\r\n Row(1, \"James\", 10, \"M\", 1000),\r\n Row(1, \"Michael\", 20, \"F\", 2000),\r\n Row(2, \"Robert\", 30, \"M\", 3000),\r\n Row(2, \"Maria\", 40, \"F\", 4000),\r\n Row(3, \"Jen\", 50, \"M\", 5000)\r\n )\r\n\r\nval sampleSchema = StructType(Array(\r\n StructField(\"id\", IntegerType, true),\r\n StructField(\"name\", StringType, true),\r\n StructField(\"age\", IntegerType, true),\r\n StructField(\"gender\", StringType, true),\r\n StructField(\"salary\", IntegerType, true)\r\n ))\r\n\r\nval df = spark.createDataFrame(sc.parallelize(sampleData), sampleSchema)\r\ndf.coalesce(1).write.format(\"parquet\").partitionBy(\"id\", \"age\").mode(\"append\").save(inputLocation);\r\nspark.read.format(\"parquet\").load(inputLocation).count();\nCreate a custom UDF to list all files in the storage path and return the last modification time for each file.\nval getModificationTimeUDF = udf((path: String) => {\r\n  val finalPath = new Path(path)\r\n  val fs = finalPath.getFileSystem(conf)\r\n  if(fs.exists(finalPath)) {fs.listStatus(new Path(path)).head.getModificationTime}\r\n  else {-1 // Or some other value based on business decision\r\n       }\r\n})\nApply the UDF to the batch job. The UDF returns each file's last modification time in UNIX time format. To convert this into a human-readable format divide by 1000 and then cast it as the\ntimestamp\n.\nval df = spark.read.format(\"parquet\").load(inputLocation)\r\n.withColumn(\"filePath\", input_file_name())\r\n.withColumn(\"fileModificationTime\", getModificationTimeUDF(col(\"filePath\")))\r\n.withColumn(\"fileModificationTimestamp\", from_unixtime($\"fileModificationTime\" / 1000, \"yyyy-MM-dd HH:mm:ss\").cast(TimestampType).as(\"timestamp\")).drop(\"fileModificationTime\")\r\ndisplay(df)\nApply the UDF to the Auto Loader streaming job.\nval sdf = spark.readStream.format(\"cloudFiles\")\r\n.schema(sampleSchema)\r\n.option(\"cloudFiles.format\", \"parquet\")\r\n.option(\"cloudFiles.includeExistingFiles\", \"true\")\r\n.option(\"cloudFiles.connectionString\", connectionString)\r\n.option(\"cloudFiles.resourceGroup\", resourceGroup)\r\n.option(\"cloudFiles.subscriptionId\", subscriptionId)\r\n.option(\"cloudFiles.tenantId\", tenantId)\r\n.option(\"cloudFiles.clientId\", clientId)\r\n.option(\"cloudFiles.clientSecret\", clientSecret)\r\n.option(\"cloudFiles.useNotifications\", \"true\")\r\n.load(inputLocation)\r\n.withColumn(\"filePath\", input_file_name())\r\n.withColumn(\"fileModificationTime\", getModificationTimeUDF(col(\"filePath\")))\r\n.withColumn(\"fileModificationTimestamp\", from_unixtime($\"fileModificationTime\" / 1000, \"yyyy-MM-dd HH:mm:ss\").cast(TimestampType).as(\"timestamp\"))\r\n.drop(\"fileModificationTime\")\r\n\r\ndisplay(sdf)\nTo recap,\ninput_file_name()\nis used to read an absolute file path, including the file name. We then created a custom UDF to list all files from the storage path. You can get the file's last modification time from each file but it is listed in UNIX time format. Convert the UNIX time format into a readable format by dividing UNIX time by 1000 and converting it to a timestamp." +} \ No newline at end of file diff --git a/scraped_kb_articles/get-notebooks-deleted-user.json b/scraped_kb_articles/get-notebooks-deleted-user.json new file mode 100644 index 0000000000000000000000000000000000000000..a14b0361ccc14eefa14a6e712858745f619ea9ea --- /dev/null +++ b/scraped_kb_articles/get-notebooks-deleted-user.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/get-notebooks-deleted-user", + "title": "Título do Artigo Desconhecido", + "content": "When you remove a user (\nAWS\n|\nAzure\n) from Databricks, a special backup folder is created in the workspace. This backup folder contains all of the deleted user’s content.\nBackup folders appear in the workspace as\n-backup-#\n.\nDelete\nInfo\nOnly an admin user can access a backup folder.\nTo access a backup folder:\nLog into Databricks as an admin user.\nSelect\nWorkspace\nfrom the sidebar.\nSelect\nUsers\n.\nSelect the backup folder.\nYou can delete the backup folder once it is no longer required." +} \ No newline at end of file diff --git a/scraped_kb_articles/get-spark-config-in-dbconnect.json b/scraped_kb_articles/get-spark-config-in-dbconnect.json new file mode 100644 index 0000000000000000000000000000000000000000..4f43f384154c345d624ff4c9cffe2c516b03efed --- /dev/null +++ b/scraped_kb_articles/get-spark-config-in-dbconnect.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/get-spark-config-in-dbconnect", + "title": "Título do Artigo Desconhecido", + "content": "You can always view the\nSpark configuration\n(\nAWS\n|\nAzure\n|\nGCP\n) for your cluster by reviewing the cluster details in the workspace.\nIf you are using DBConnect (\nAWS\n|\nAzure\n|\nGCP\n) you may want to quickly review the current\nSpark configuration\ndetails without switching over to the workspace UI.\nThis example code shows you how to get the current\nSpark configuration\nfor your cluster by making a REST API call in DBConnect.\n%python\r\n\r\nimport json\r\nimport requests\r\nimport base64\r\n\r\nwith open(\"//.databricks-connect\") as readconfig:\r\n    conf = json.load(readconfig)\r\n\r\nCLUSTER_ID = conf[\"cluster_id\"]\r\nTOKEN = conf[\"token\"]\r\nAPI_URL = conf[\"host\"]\r\n\r\nheaders = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + TOKEN}\r\npayload =  {'cluster_id': '' + CLUSTER_ID}\r\nresponse = requests.get(API_URL + \"/api/2.0/clusters/get/?cluster_id=\"+CLUSTER_ID, headers=headers, json = payload)\r\nsparkconf = response.json()[\"spark_conf\"]\r\n\r\nfor config_key, config_value in sparkconf.items():\r\n    print(config_key, config_value)\nDelete\nWarning\nDBConnect only works with supported Databricks Runtime versions. Ensure that you are using a supported runtime on your cluster before using DBConnect." +} \ No newline at end of file diff --git a/scraped_kb_articles/get-workspace-configuration-details.json b/scraped_kb_articles/get-workspace-configuration-details.json new file mode 100644 index 0000000000000000000000000000000000000000..3d81e687f1f08bd56ad0ac58ebb0710a3b52edf0 --- /dev/null +++ b/scraped_kb_articles/get-workspace-configuration-details.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/get-workspace-configuration-details", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to display the complete configuration details for your Databricks workspace.\nThis can be useful if you want to review the configuration settings and services that are enabled in your workspace. For example, you can use the workspace configuration details to quickly see if Unity Catalog or Identity Federation is enabled on your workspace.\nAdditionally, the workspace configuration contains cluster configuration information for the clusters in your workspace.\nInstructions\nLogin to your Databricks workspace.\nLook at the URL displayed in your browser's address bar.\nDelete your\nworkspace ID\nfrom the workspace URL.\nAppend\n/config\nto the workspace URL, immediately after the instance name. For example,\nhttps:///config\nLoad the new URL to display the workspace configuration details.\nThe workspace configuration is displayed as plain text. It can be used to review all of the configuration settings and for the workspace.\nExample configuration display\nAWS\nAzure\nGCP\nA few example properties are listed in the table below. This is a not a comprehensive list of all configuration settings.\nProperty\nFlag\nList Account ID\naccountId\nCheck if Identity Federation is enabled\nidentityFederationEnabled\nCheck if Unity Catalog is enabled\nunityCatalogServiceEnabled\nCheck if the Azure 'Manage Account' Tab is enabled\nenableAzureManageAccountTab" +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-a-404-error-when-creating-a-serverless-budget-policy-with-the-workspace-level-api.json b/scraped_kb_articles/getting-a-404-error-when-creating-a-serverless-budget-policy-with-the-workspace-level-api.json new file mode 100644 index 0000000000000000000000000000000000000000..2b9e24ed56ed73609582d1611e30c4378b1c7f2d --- /dev/null +++ b/scraped_kb_articles/getting-a-404-error-when-creating-a-serverless-budget-policy-with-the-workspace-level-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metrics/getting-a-404-error-when-creating-a-serverless-budget-policy-with-the-workspace-level-api", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to create a serverless budget policy using the workspace-level API, you receive a 404 error.\nERROR: Failed to create budget policy: 404\r\nEndpoint not found\nCause\nThe API endpoint used is not correct or accessible for workspace-level policy creation.\nSolution\nIf you want to create a serverless budget policy at the workspace level, use the UI. For details, refer to the “Create a serverless budget policy” section of the\nAttribute usage with serverless budget policies\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nTo continue using an API to create a serverless budget policy, do so at the account-level instead. For more information, review the\nAuthorizing access to Databricks resources\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nThen, you can use the billing admin role to access the policy in your workspace.\nAccount-level API example\nFirst, define your account console endpoint based on your cloud platform.\n(AWS)\ncloud.databricks.com\n(Azure)\nazuredatabricks.net\n(GCP)\ngcp.databricks.com\nNext, create your budget policy.\nPOST\r\nhttps:///api/2.1/accounts//budget-policies\nFor more information about the Budget policy account console API, review the\nCreate a budget policy\n(\nAWS\n|\nAzure\n|\nGCP\n) API documentation.\nThen, update your budget policy rules to provide specific permissions.\nPUT\r\nhttps:///api/2.0/preview/accounts//access-control/rule-sets\nFor more information about updating a budget policy rule, review the\nUpdate a rule set\n(\nAWS\n|\nAzure\n|\nGCP\n) API documentation.\nImplementation example in Python\nThe following code example\nCreates a budget policy.\nGets the policy ID and policy etag to perform changes in the budget policy.\nDefines users to assign permissions for the budget policy.\nUpdates policy rules.\nimport requests\r\napi_key = \"\"\r\nhost = \"\nhttps://\n\"\r\naccount_id = \"\"\r\nhead = {\r\n  \"Authorization\" : \"Bearer \" + api_key\r\n}\r\n\r\ndef create_budget_policy(name):\r\n create = requests.post(f\"{host}/api/2.1/accounts/{account_id}/budget-policies\", json={\r\n   \"policy_name\" : name,\r\n   \"custom_tags\" : [{\r\n     \"key\" : \"tag1\", \"value\" : \"val1\"\r\n   }, {\r\n     \"key\" : \"tag2\", \"value\" : \"val2\"\r\n   }]\r\n }, headers=head)\r\n return create.json()[\"policy_id\"] #Add error handling as required\r\n\r\n\r\ndef get_policy_id_from_policy_name(name):\r\n try:\r\n  get_name = requests.get(f\"{host}/api/2.1/accounts/{account_id}/budget-policies\", headers=head)\r\n  policy_id = None\r\n  for policy in get_name.json()['policies']:\r\n   if policy['policy_name'] == name:\r\n    policy_id = policy['policy_id']\r\n  return policy_id\r\n except Exception as e:\r\n  return None\r\n\r\ndef get_etag(policy_id):\r\n etag = requests.get(f\"{host}/api/2.1/preview/accounts/{account_id}/access-control/rule-sets\", params={\r\n  \"name\" : f\"accounts/{account_id}/budgetPolicies/{policy_id}/ruleSets/default\",\r\n  \"etag\" : \"\"\r\n }, headers=head)\r\n return etag.json()['etag']\r\n\r\ndef get_req_body():\r\n #take below details as parameter in actual script, this is demo script\r\n users = [\"user1@example.com\", \"user2@example.com\"] #Will by default give user role\r\n service_principal = [\"00000-0000-0000-0000-000000\", \"111111-1111-1111-1111-111111\"] #Will give manager role. Please update the script if different service principal need to have different role.\r\n groups = [\"group_example_name\"]\r\n  \r\n data = []\r\n  \r\n for u in users:\r\n  data.append({\r\n   \"role\" : \"roles/budgetPolicy.user\",\r\n   \"principals\" : [\r\n    f\"users/{u}\"\r\n   ]\r\n  })\r\n for s in service_principal:\r\n  data.append({\r\n   \"role\" : \"roles/budgetPolicy.manager\",\r\n   \"principals\" : [\r\n    f\"servicePrincipals/{s}\"\r\n   ]\r\n  })\r\n   \r\n for g in groups:\r\n  data.append({\r\n   \"role\" : \"roles/budgetPolicy.manager\",\r\n   \"principals\" : [\r\n    f\"groups/{g}\"\r\n   ]\r\n  })\r\n   \r\n return data\r\n\r\n\r\ndef update_budget_rules(policy_id, etag):\r\n  \r\n print({\r\n  \"name\" : f\"accounts/{account_id}/budgetPolicies/{policy_id}/ruleSets/default\",\r\n  \"rule_set\" : {\r\n  \"name\" : f\"accounts/{account_id}/budgetPolicies/{policy_id}/ruleSets/default\",\r\n  \"description\" : \"\",\r\n  \"grant_rules\" : get_req_body(),\r\n  \"etag\" : etag\r\n }})\r\n  \r\n assign = requests.put(f\"{host}/api/2.1/preview/accounts/{account_id}/access-control/rule-sets\", json={\r\n  \"name\" : f\"accounts/{account_id}/budgetPolicies/{policy_id}/ruleSets/default\",\r\n  \"rule_set\" : {\r\n  \"name\" : f\"accounts/{account_id}/budgetPolicies/{policy_id}/ruleSets/default\",\r\n  \"description\" : \"\",\r\n  \"grant_rules\" : get_req_body(),\r\n  \"etag\" : etag\r\n }}, headers=head)\r\n return None\r\n\r\nif __name__ == \"__main__\":\r\n policy_name = \"\"\r\n policy_id = get_policy_id_from_policy_name(policy_name)\r\n policy_id = create_budget_policy(policy_name) if not policy_id else policy_id\r\n etag = get_etag(policy_id=policy_id)\r\n update_budget_rules(policy_id=policy_id, etag=etag)" +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-a-base64-error-when-executing-a-udf-on-a-serverless-cluster.json b/scraped_kb_articles/getting-a-base64-error-when-executing-a-udf-on-a-serverless-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..17383efbc62f081a8fb3a9a2dfdac68b6455ff4d --- /dev/null +++ b/scraped_kb_articles/getting-a-base64-error-when-executing-a-udf-on-a-serverless-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/getting-a-base64-error-when-executing-a-udf-on-a-serverless-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to migrate a PySpark user-defined function (UDF) to a Unity Catalog SQL UDF and execute it on a serverless cluster, the query fails with a runtime exception.\nJob aborted due to stage failure: java.lang.IllegalArgumentException: Invalid base64 input: Check end padding and character grouping.\nYou notice the same query works correctly on all-purpose compute clusters.\nCause\nThe serverless cluster’s Photon engine has stricter base64 input validation.\nSpecifically, the\nUNBASE64\nfunction inside the UDF raises an exception when it encounters malformed base64 strings, whereas the standard Apache Spark engine on all-purpose clusters silently returns\nNULL\nfor such cases.\nSolution\nUpdate the UDF to use\nTRY_TO_BINARY(uid, 'BASE64')\ninstead of\nUNBASE64(uid)\n.\nThe\nTRY_TO_BINARY\nfunction handles malformed base64 inputs by returning\nNULL\ninstead of throwing an exception, making the UDF compatible with both all-purpose and serverless clusters.\nAdditionally, you can use the following code to identify malformed data beforehand to proactively detect and address problematic values in a dataset.\nSELECT uid FROM table WHERE uid IS NOT NULL AND TRY_TO_BINARY(uid, 'BASE64') IS NULL;" +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-a-concurrentmodificationexception-in-pyspark-crossvalidator-on-databricks-runtime-154-lts.json b/scraped_kb_articles/getting-a-concurrentmodificationexception-in-pyspark-crossvalidator-on-databricks-runtime-154-lts.json new file mode 100644 index 0000000000000000000000000000000000000000..f1c4ab252ec3747785fa0d056c7dcdfef395ac30 --- /dev/null +++ b/scraped_kb_articles/getting-a-concurrentmodificationexception-in-pyspark-crossvalidator-on-databricks-runtime-154-lts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/getting-a-concurrentmodificationexception-in-pyspark-crossvalidator-on-databricks-runtime-154-lts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Databricks Runtime 15.4 ML and running\nCrossValidator\nfrom\npyspark.ml\nwith a large number of hyperparameter combinations, you encounter an error. The following code shows the full error stack trace.\nPy4JJavaError: An error occurred while calling o6242.evaluate.\r\n: java.util.ConcurrentModificationException\r\n\tat java.util.Hashtable$Enumerator.next(Hashtable.java:1408)\r\n\tat java.util.Hashtable.putAll(Hashtable.java:523)\r\n\tat org.apache.spark.util.Utils$.cloneProperties(Utils.scala:3474)\r\n\tat org.apache.spark.SparkContext.getCredentialResolvedProperties(SparkContext.scala:523)\r\n\tat org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:3157)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1104)\r\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)\r\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)\r\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\r\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:454)\r\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1102)\r\n\tat org.apache.spark.mllib.evaluation.AreaUnderCurve$.of(AreaUnderCurve.scala:44)\r\n\tat org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:127)\r\n\tat org.apache.spark.ml.evaluation.BinaryClassificationEvaluator.evaluate(BinaryClassificationEvaluator.scala:101)\r\n\tat sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source)\r\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\r\n\tat java.lang.reflect.Method.invoke(Method.java:498)\r\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\r\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)\r\n\tat py4j.Gateway.invoke(Gateway.java:306)\r\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\r\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\r\n\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)\r\n\tat py4j.ClientServerConnection.run(ClientServerConnection.java:119)\r\n\tat java.lang.Thread.run(Thread.java:750)\nCause\nThis error typically occurs when the\nestimatorParamMaps\nin\nCrossValidator\ncontains a high number of parameter combinations (such as 60). The following code provides an example.\nparam_grid = ParamGridBuilder() \\\r\n    .addGrid(lr_model.elasticNetParam, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]) \\\r\n    .addGrid(lr_model.regParam, [0, 0.001, 0.01, 0.1, 1.0, 10.0]) \\\r\n    .build()\r\n\r\nprint(\"Num combinations:\", len(param_grid))\r\n\r\n# Create a binary classification evaluator\r\nevaluator = BinaryClassificationEvaluator(rawPredictionCol=\"rawPrediction\",\r\n                                         labelCol=\"label\",\r\n                                         metricName=\"areaUnderROC\")\r\n\r\n# Create a cross-validator for hyperparameter tuning\r\ncv = CrossValidator(estimator=lr_model, estimatorParamMaps=param_grid,\r\n                    evaluator=evaluator, numFolds=5, parallelism=32)\nWith higher parameter combinations, you increase concurrency, which triggers the issue more often. The increased concurrency includes internal concurrent property modifications (manual clones).\nIn Databricks Runtime 15.4 LTS, you can encounter a rare race condition in\nCrossValidator\n. This condition leads to the\nConcurrentModificationException\nwhich happens during that manual cloning, specifically of Apache Spark properties in\norg.apache.spark.util.Utils.cloneProperties\n.\nSolution\nEnable the following Spark configuration on your cluster. This config enables the standard\nProperties.clone\ninstead of manual clone.\nspark.databricks.property.standardClone.enabled true\nFor details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAlternatively, you can upgrade to a Databricks Runtime version above 15.4 LTS." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-an-inconsistentreadexception-error-after-updating-to-databricks-runtime-133-lts-or-above.json b/scraped_kb_articles/getting-an-inconsistentreadexception-error-after-updating-to-databricks-runtime-133-lts-or-above.json new file mode 100644 index 0000000000000000000000000000000000000000..80ee2a0e13214d68da46e88f5f47a2fe003bd628 --- /dev/null +++ b/scraped_kb_articles/getting-an-inconsistentreadexception-error-after-updating-to-databricks-runtime-133-lts-or-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/getting-an-inconsistentreadexception-error-after-updating-to-databricks-runtime-133-lts-or-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou updated Databricks Runtime from a version below 13.3 LTS, and the following issue now appears in the logs.\nCaused by: com.databricks.common.filesystem.InconsistentReadException: The file might have been updated during query execution. Ensure that no pipeline updates existing files during query execution and try again.\nCause\nThere is a consistency check introduced in Databricks Runtime 13.3 LTS and above.\nIn earlier versions such as Databricks Runtime 10.4 LTS, Databricks Runtime would read a file between query planning and execution even if it was updated, which could lead to unpredictable results. Databricks Runtime 13.3 LTS now returns an error if a file is updated during these stages to prevent inconsistencies.\nSolution\nApply the following configurations to disable file status caching. Disabling file status caching minimizes inconsistencies by reducing the duration files are kept in the cache.\nSet\ndatabricks.loki.fileStatusCache.enabled\nto\nfalse\n.\nSet\nspark.hadoop.databricks.loki.fileStatusCache.enabled\nto\nfalse\n.\nNote\nReducing the time files are kept in cache reduces the time between file status checks. It does not guarantee that the issue will be resolved during read and write operations.\nIf the issue persists, please check if another application is updating the file while you are trying to read it." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-assertionerror-when-using-the-withcolumns-function-in-databricks-runtime-14-3-lts-or-above.json b/scraped_kb_articles/getting-assertionerror-when-using-the-withcolumns-function-in-databricks-runtime-14-3-lts-or-above.json new file mode 100644 index 0000000000000000000000000000000000000000..62cf63a23b52e360339b43070ce76c450470a077 --- /dev/null +++ b/scraped_kb_articles/getting-assertionerror-when-using-the-withcolumns-function-in-databricks-runtime-14-3-lts-or-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/getting-assertionerror-when-using-the-withcolumns-function-in-databricks-runtime-14-3-lts-or-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn Databricks Runtime 14.3 LTS, you use the\nwithColumns()\nfunction on an Apache Spark DataFrame to create new columns by referencing existing columns with the same names. The following example code uses id, name, category.\ndf1 = df.withColumns(\r\n    [(\"id\", \"id\"), (\"name\", \"name\"), (\"category\", \"category\")]\r\n)\nYou then encounter the following error.\nPy4JJavaError: An error occurred while calling o395.sql.\r\n: java.util.NoSuchElementException: key not found: LocationType#214517\nCause\nIn Databricks Runtime 14.3 LTS or above, the\nwithColumns()\nfunction requires column names to be wrapped with the\ncol()\nwrapper.\nSolution\nWrap column names with the\ncol()\nwrapper when using the\nwithColumns()\nfunction.\nfrom pyspark.sql.functions import col\r\n\r\ndf1 = df.withColumns(\r\n    [(col(c), col(c)) for c in [\"id\", \"name\", \"category\"]]\r\n)\nFor more information, refer to the\npyspark.sql.functions.col\nAPI documentation.\nBest practices\nRegularly review Databricks Runtime release notes for changes that might affect your code.\nWhen upgrading to a new Databricks Runtime version, test your existing code to identify and address any compatibility issues. Refer to the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for the corresponding upgrade version to assist." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-cannot-operate-on-a-handle-that-is-closed-error-when-running-an-apache-spark-job.json b/scraped_kb_articles/getting-cannot-operate-on-a-handle-that-is-closed-error-when-running-an-apache-spark-job.json new file mode 100644 index 0000000000000000000000000000000000000000..89138deacf4c4c5b089bb698730a89342107966a --- /dev/null +++ b/scraped_kb_articles/getting-cannot-operate-on-a-handle-that-is-closed-error-when-running-an-apache-spark-job.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/getting-cannot-operate-on-a-handle-that-is-closed-error-when-running-an-apache-spark-job", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile running an Apache Spark job in either a workflow or a notebook, you receive a\n\"Cannot operate on a handle that is closed\"\nerror.\nWhen you check the stack trace, you see the following output.\norg.apache.spark.sql.execution.streaming.sources.ForeachBatchUserFuncException: [FOREACH_BATCH_USER_FUNCTION_ERROR] An error occurred in the user provided function in foreach batch sink. Reason: An exception was raised by the Python Proxy\r\n  py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.expr. Trace:\r\n  org.apache.spark.SparkException: Cannot operate on a handle that is closed.\r\n  at com.databricks.unity.HandleImpl.assertValid(UCSHandle.scala:98)\r\n  at com.databricks.unity.HandleImpl.setupThreadLocals(UCSHandle.scala:116)\r\n  at com.databricks.backend.daemon.driver.SparkThreadLocalUtils$$anon$1.run(SparkThreadLocalUtils.scala:48)\r\n  at java.lang.Iterable.forEach(Iterable.java:75)\r\n  at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:198)\r\n  at py4j.ClientServerConnection.run(ClientServerConnection.java:119)\r\n  at java.lang.Thread.run(Thread.java:750)\nCause\nYou’re using multithreading to access Unity Catalog (UC) objects in the\nForEachBatch\nof a streaming job, especially using Python’s ThreadPoolExecutor library.\nWhen the UC credentials are not available to the threads created by ThreadPoolExecutor, the credentials are not passed on to the thread pool threads as they should be, leading to the\n\"Cannot operate on a handle that is closed”\nerror.\nSolution\nDatabricks does not recommend using ThreadPoolExecutor in Python and ForEachBatch together when accessing UC objects. Instead, consider the following options:\nIf you’re using multi-threading for fan-out operations, write the operations using multiple streams instead.\nMove to Scala and use supported thread pool types. You can use the special thread pools in\n`org.apache.spark.util.ThreadUtils`\n, such as\n`org.apache.spark.util.ThreadUtils.newDaemonFixedThreadPool`\n.\nFor more information about Scala thread pools in Unity Catalog, refer to the “Limitations” section of the\nWhat is Unity Catalog?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-dashboard-download-successful-message-but-not-downloading.json b/scraped_kb_articles/getting-dashboard-download-successful-message-but-not-downloading.json new file mode 100644 index 0000000000000000000000000000000000000000..a80a2a529f4f9ad73fb8393a53ba24fff776aee1 --- /dev/null +++ b/scraped_kb_articles/getting-dashboard-download-successful-message-but-not-downloading.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/getting-dashboard-download-successful-message-but-not-downloading", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you export a Databricks SQL Analytics dashboard, you see a message that the export was successful, but the dashboard doesn’t seem to have downloaded.\nCause\nYour organization may have a browser policy restricting downloads.\nSolution\nFirst, choose your respective browser from the following list and following the steps to navigate to the browser policy. Current version refers to the most up to date version as of July 2025.\nGoogle Chrome (current version)\nOpen Chrome.\nGo to:\nchrome://policy\nIn the list of active policies, look for download-related keys.\nMozilla Firefox (current version)\nOpen Firefox.\nGo to:\nabout:policies\nRefer to the\nActive\ntab for relevant download policies.\nMicrosoft Edge (current version – Chromium-based)\nOpen Edge.\nGo to:\nedge://policy\nLook for download-related policies.\nApple Safari (current version – macOS only)\nSafari doesn’t have an internal policy viewer, but download behavior can be restricted by system-managed preferences (mobile device management (MDM) profiles or Terminal settings).\nOpen Terminal.\nRun:\ndefaults read com.apple.Safari\nor\n/usr/bin/profiles show -type configuration\nLook for download keys.\nThen, verify whether you have browser policies in place that restrict downloads. Consider adjusting them to allow downloads for Databricks or specifically for the domain used by Databricks in your organization.\nLast, reach out to your organization's security or IT team to understand the current browser policies enforced within your organization. Discuss the requirement to export dashboards from Databricks and see if the policies can be adjusted to accommodate this functionality." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-error-comunivocityparserscommontextparsingexception-when-parsing-data.json b/scraped_kb_articles/getting-error-comunivocityparserscommontextparsingexception-when-parsing-data.json new file mode 100644 index 0000000000000000000000000000000000000000..6d0c6d9df304c7d8f704242fb3e5fd7a451f331f --- /dev/null +++ b/scraped_kb_articles/getting-error-comunivocityparserscommontextparsingexception-when-parsing-data.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/getting-error-comunivocityparserscommontextparsingexception-when-parsing-data", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou use the following code to parse data using a notebook or Auto Loader.\ndf = (spark. readStream\r\nformat (\"cloudFiles\")\r\n.option (\"cloudFiles. format\", \"csv\" )\r\n.option(\"useStrictGlobber\", \"true\")\r\n.option (\"header\", \"true\")\r\n.option (\"sep\", \";\")\r\n.option (\"cloudFiles.schemaLocation\"\r\n.schema_location)\r\n.load (source_path) )\nYou then receive the following error.\nPy4JJavaError: An error occurred while calling o693.load.\r\n\r\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.139.64.10 executor driver): com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480\r\n\r\nHint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have\r\n\r\nEnsure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse\nCause\nUnivocity parser, a Java library used by Apache Spark internally to parse CSV/text files, is causing the error. When univocity parser cannot properly parse text data, it throws a\nTextParsingException\nruntime error.\nThe failure to parse text data arises when a row is malformed.\nSolution\nFirst, verify that the delimiter used in your read operation matches the format of your input files.\nIn a notebook, run the following code to ensure your read configuration is accurate. This code uses a semicolon delimiter. If you’re using a comma, you can set\n“,”\nas the second parameter.\ndf = spark.read.option(\"delimiter\", \";\").csv(\"\")\nIf you’re unsure about which delimiter is in use, open the file directly using Databricks File System (DBFS) or use the\nData\ntab in the Databricks UI to preview it.\nThen make the necessary corrections to the data." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-error-not_implemented-tojson-is-not-implemented-when-trying-to-convert-a-dataframe-to-a-json-string.json b/scraped_kb_articles/getting-error-not_implemented-tojson-is-not-implemented-when-trying-to-convert-a-dataframe-to-a-json-string.json new file mode 100644 index 0000000000000000000000000000000000000000..845abb2641bca5a2431cdc26495b5a1679a67e46 --- /dev/null +++ b/scraped_kb_articles/getting-error-not_implemented-tojson-is-not-implemented-when-trying-to-convert-a-dataframe-to-a-json-string.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/getting-error-not_implemented-tojson-is-not-implemented-when-trying-to-convert-a-dataframe-to-a-json-string", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou use a Unity Catalog (UC) standard (formerly shared) access mode cluster to execute a command similar to the following example.\ndf = spark.createDataFrame([(2, \"Alice\"), (5, \"Bob\")], schema=[\"age\", \"name\"])\r\ndf.toJSON().first()\nUpon execution, you get the error\n[NOT_IMPLEMENTED] toJSON() is not implemented\n. The following screenshot shows the error in the notebook UI.\nCause\ntoJSON()\nis not implemented in UC standard mode clusters for security reasons.\nSpecifically,\ntoJSON()\nconverts a DataFrame into an RDD of a string, and RDD APIs are not supported in UC standard mode clusters.\nSolution\nUse a dedicated (formerly single-user) access mode cluster instead.\nAlternatively, you can use\nto_json\n, which returns a JSON string with the STRUCT or VARIANT specified in the expression.\nFor more information, refer to the\nto_json\nfunction\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-error-when-trying-to-connect-to-sftp-server-from-databricks-using-passwordless-authentication.json b/scraped_kb_articles/getting-error-when-trying-to-connect-to-sftp-server-from-databricks-using-passwordless-authentication.json new file mode 100644 index 0000000000000000000000000000000000000000..43de4eb49bef41166cf019d55b87e3ce7641aa9c --- /dev/null +++ b/scraped_kb_articles/getting-error-when-trying-to-connect-to-sftp-server-from-databricks-using-passwordless-authentication.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/getting-error-when-trying-to-connect-to-sftp-server-from-databricks-using-passwordless-authentication", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to connect to the SFTP server from Databricks using passwordless authentication, you receive the following error.\nError Message : Host Key Verification Failed\nCause\nCluster restarts delete the data stored in the local disk. The private key is not preserved in the erasure, resulting in host key verification failures.\nSolution\nCreate an RSA authentication key to access a remote site from your Databricks account and preserve the private key.\nGenerate an SSH key-pair\nFirst, create an SSH key pair inside or outside Databricks accordingly. To create the RSA key pair in Databricks, run the following command in a Databricks notebook.\n%sh\r\nssh-keygen -t rsa -N \"\" -f ~/.ssh/id_rsa\nThis command creates an RSA key pair without a passphrase at the following location.\nPrivate key:\n~/.ssh/id_rsa\nPublic key:\n~/.ssh/id_rsa.pub\nPreserve the public and private keys\nCopy or upload the generated public and private keys to a secure location, such as workspace files, cloud storage, or volume.\nCreate a cluster init script and attach it to automate the restoration of the SSH keys during cluster startup. This init script can be used to copy the SSH keys from your secure location to the appropriate location on the cluster.\n#!/bin/bash\r\nsleep 5\r\n#nodes don’t have .ssh by default\r\nmkdir -p /root/.ssh/\r\n#copy the private key to .ssh\r\ncp /root/.ssh/id_rsa\r\n#modify the permissions of the private key file\r\nchmod 400 /root/.ssh/id_rsa\nFor more information on creating an init script, refer to the\nWhat are init scripts?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCopy the public key to a remote server\nCopy the public key (\nid_rsa.pub\n) to the remote server's\n~/.ssh/authorized_keys\nfile. Ensure that the permissions of the\n~/.ssh\nfolder and the\nauthorized_keys\nfile on the remote server are set correctly to avoid access issues.\nTest the connection from a Databricks notebook\n%sh\r\nssh user@remote_host\nOr\n%sh\r\nsftp user@remote_host" +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline.json b/scraped_kb_articles/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline.json new file mode 100644 index 0000000000000000000000000000000000000000..9917538d4ef624a535d9ca1d412b7e465083bafd --- /dev/null +++ b/scraped_kb_articles/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/getting-node-specific-instead-of-cluster-wide-memory-usage-data-from-system-compute-node_timeline", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to programmatically get a cluster’s memory usage using\nsystem.compute.node_timeline\n, you get node-specific data instead of cluster-wide data.\nCause\nDetermining cluster-wide memory usage is not possible with a single system table.\nSolution\nTo check memory usage (in bytes), join the table\nnode_timeline\nwith the table\nnode_types\n. Run the following code in a notebook, through a job, or with Databricks SQL.\nselect cluster_id, instance_id, start_time, end_time, round(mem_used_percent / 100 * node_types.memory_mb, 0) as mem_used_mb\r\nfrom system.compute.node_timeline\r\njoin system.compute.node_types using(node_type)\r\norder by start_time desc;\nFor more information, refer to the\nCompute system tables reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-nomodulefound-or-attribute-error-when-using-the-flash-attention-model-in-mlflow.json b/scraped_kb_articles/getting-nomodulefound-or-attribute-error-when-using-the-flash-attention-model-in-mlflow.json new file mode 100644 index 0000000000000000000000000000000000000000..0d214103e31e07e1d6649f8c30e199a384cdadc9 --- /dev/null +++ b/scraped_kb_articles/getting-nomodulefound-or-attribute-error-when-using-the-flash-attention-model-in-mlflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/getting-nomodulefound-or-attribute-error-when-using-the-flash-attention-model-in-mlflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the Flash Attention model from the Hugging Face library, you receive attribute errors or\nNoModuleFound\nerrors such as the following.\n“AttributeError: 'CLIPTextTransformer' object has no attribute '_use_flash_attention_2'.”\nCause\nFlash-attention\nmodels are not supported with the\npyfunc\nflavor in MLflow because they are incompatible with PyTorch and CUDA versions that require a custom version.\nMLflow’s Model Serving infrastructure also does not accommodate using\nflash-attention\nin\npyfunc\nmodels.\nSolution\nWhen logging a model that requires\nflash-attention\nuse\nmlflow.transformers.log_model\nwith a custom wheel version of\nflash-attn\n.\nSpecify all pip requirements as a list and pass the list as a parameter into the\nmlflow.transformers.log_model\nfunction call.\nMake sure to indicate the versions of\npytorch\n,\ntorch\n, and\ntorchvision\nwhich are compatible with the CUDA version you specify in your\nflash-attention\nwheel.\nDatabricks recommends using the following versions and wheels.\nPytorch (\nindex page with file download links\n)\nTorch 2.0.1+cu118\nTorchvision 0.15.2+cu118\nFlash-attention (\nGithub\n.whl\ndownload\n)\nExample\nmlflow.transformers.log_model(\r\n f\"{model_name}\",\r\n registered_model_name=f\"{model_name}\",\r\n extra_pip_requirements=[\r\n \"git+\nhttps://github.com/huggingface/diffusers@v0.22.1\n\",\r\n \"peft==0.12.0\",\r\n \"compel==2.0.3\",\r\n \"boto3==1.34.39\",\r\n \"transformers==4.39.2\", \r\n \"--extra-index-url https://download.pytorch.org/whl/cu118\", \t\t\t\t \r\n \"torch==2.0.1+cu118\", \r\n \"torchvision==0.15.2+cu118\", \"\nhttps://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu118torch2.0cxx11abiFALSE-cp311-cp311-linux_x86_64.whl\n\"\r\n ]\r\n)\nAdditional context\nWhen using custom GPU serving, Databricks Model Serving will first resolve the version of PyTorch or Tensorflow that your model uses, then install a compatible CUDA version for the PyTorch or Tensorflow version it detects.\nNote\nFor PyTorch, Databricks relies on the\ntorch pip\npackage to determine the compatible CUDA version, and for Tensorflow, Databricks determines a compatible version based on the GPU section of Tensorflow’s\nBuild from source\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-nullpointerexception-when-using-dbutilssecretsget-in-jar-jobs.json b/scraped_kb_articles/getting-nullpointerexception-when-using-dbutilssecretsget-in-jar-jobs.json new file mode 100644 index 0000000000000000000000000000000000000000..7f1bc1415bdbbb93e507c7b0217de7f13e54fe6f --- /dev/null +++ b/scraped_kb_articles/getting-nullpointerexception-when-using-dbutilssecretsget-in-jar-jobs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/getting-nullpointerexception-when-using-dbutilssecretsget-in-jar-jobs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the\ndbutils.secrets.get(scopeName, secretKey)\nmethod in a jar job you encounter a\nNullPointerException\n.\nThis issue arises even though the same code works when executed in a notebook.\nCause\nThe\ndbutils\nlibrary is designed for use within notebooks and requires additional configuration to be accessible in jar jobs.\nSolution\nEnsure that the necessary dependencies for\ndbutils\nare included in your Java project. Reference the\nDatabricks SDK for Java\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to include the required libraries.\nAdd the following dependency to your project's build file (such as\npom.xml\nfor Maven).\n\r\n com.databricks\r\n dbutils-api\r\n 1.0.0\r\n\nRebuild your jar file with the updated dependencies and deploy it to your Databricks environment.\nRerun your jar job." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-remote-repo-not-found-error-while-trying-to-create-or-sync-git-repo.json b/scraped_kb_articles/getting-remote-repo-not-found-error-while-trying-to-create-or-sync-git-repo.json new file mode 100644 index 0000000000000000000000000000000000000000..c44c0928f7af72de0ba9cf6d5a670a3f7d763d06 --- /dev/null +++ b/scraped_kb_articles/getting-remote-repo-not-found-error-while-trying-to-create-or-sync-git-repo.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/getting-remote-repo-not-found-error-while-trying-to-create-or-sync-git-repo", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-request_limit_exceeded-when-using-catalog-information-schema.json b/scraped_kb_articles/getting-request_limit_exceeded-when-using-catalog-information-schema.json new file mode 100644 index 0000000000000000000000000000000000000000..cf886a8bfa412ed4f30110b737ef3a33a0d4be44 --- /dev/null +++ b/scraped_kb_articles/getting-request_limit_exceeded-when-using-catalog-information-schema.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/getting-request_limit_exceeded-when-using-catalog-information-schema", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you make frequent calls to your Unity Catalog schema, you encounter a request rate limit error.\n“[RequestId=ErrorClass=REQUEST_LIMIT_EXCEEDED.REQUEST_LIMIT_EXCEEDED]\r\nYour request was rejected since your organization has exceeded the rate limit. Please retry your request later.”\nCause\nYour organization has exceeded its allocated rate limit. This limit is in place to maintain Databricks service stability. Frequent calls to information schema queries can overload the service and trigger the error.\nIn some cases, the error may be caused by concurrent queries that are executed too frequently, leading to a large number of Remote Procedure Calls (RPCs) that hit the rate limit. Every hour, hundreds of queries against the\n`columns`\ntable may occur, which can result in a larger number of RPCs and trigger the rate limit.\nSolution\nThere are three actions you can take to reduce the likelihood of receiving a rate limit error.\nReduce the concurrency of information schema queries by making fewer calls to the API. These queries scan the Unity Catalog service database directly, so the frequency should be low.\nAdd supported selective filters to the query. Databricks supports push filters like\n`column_name (like/=/>/=/<=) `\n. Using selective filters can reduce the scanned data, making the query faster and reducing overhead. The column name can be\n``\n,\n``\n, or\n``\n.\nAvoid issuing information schema queries too frequently or you will hit a rate limit. Treat these queries like any other calls the REST API sends to the Databricks service.\nFor more information, review the\nInformation schema\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-site-cant-be-reached-error-when-trying-to-access-workspace.json b/scraped_kb_articles/getting-site-cant-be-reached-error-when-trying-to-access-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..febef4ae02403b30c7f4f552fb9f2e6571820d5b --- /dev/null +++ b/scraped_kb_articles/getting-site-cant-be-reached-error-when-trying-to-access-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/getting-site-cant-be-reached-error-when-trying-to-access-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to access your workspace, you receive a server error saying the site can’t be reached. The following image shows the error from the browser UI.\nCause\nThe workspace is using a custom DNS that does not match the hostname IP address.\nSolution\nFirst, obtain the custom DNS CIDR from your cloud console.\nFrom the terminal, run\nnslookup\non your hostname.\n%sh\r\nnslookup  \nNext, ensure the hostname is resolving to the IP address matching the DNS server records you checked in step 2.\nEnsure your IP and the configured IP from the custom DNS server match. Otherwise, point your host to the proper DNS classless inter-domain routing (CIDR)." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-this-feature-is-not-supported-error-when-using-dropduplicates-operation-on-variant-type-fields.json b/scraped_kb_articles/getting-this-feature-is-not-supported-error-when-using-dropduplicates-operation-on-variant-type-fields.json new file mode 100644 index 0000000000000000000000000000000000000000..3f2960665adb1074440cdd68ef57e4863a6f000f --- /dev/null +++ b/scraped_kb_articles/getting-this-feature-is-not-supported-error-when-using-dropduplicates-operation-on-variant-type-fields.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/getting-this-feature-is-not-supported-error-when-using-dropduplicates-operation-on-variant-type-fields", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn Spark SQL, you use the operation\ndropDuplicates()\nwith variant type fields. The following code shows a usage example.\n# adding a variant type field to a dataframe\r\ndf = df.withColumn(\"\", parse_json(col(‘’)))\r\n\r\n# dropping duplicates\r\ndf = df.dropDuplicates()\nYou then receive the following error.\nThe feature is not supported: Cannot have VARIANT type columns in DataFrame which calls set operations (INTERSECT, EXCEPT, etc.), but the type of column `` is \"VARIANT\".\nYou may also see this kind of\nMatchError\nin your logs.\n25/03/03 00:31:55 ERROR Executor: Exception in task 2.0 in stage 51.0 (TID 201)\r\nscala.MatchError: {“x”:11,”y”:22} (of class org.apache.spark.unsafe.types.VariantVal)\r\n  at org.apache.spark.sql.catalyst.expressions.InterpretedHashFunction.hash(hash.scala:559)\r\n  at org.apache.spark.sql.catalyst.expressions.Murmur3Hash.computeHash(hash.scala:658)\r\n  at org.apache.spark.sql.catalyst.expressions.Murmur3Hash.computeHash(hash.scala:648)\r\n  at org.apache.spark.sql.catalyst.expressions.HashExpression.eval(hash.scala:316)\r\n  at org.apache.spark.sql.catalyst.expressions.Pmod.eval(arithmetic.scala:1117)\nCause\nExecuting comparing operations like\ndropDuplicates()\n– or\nINTERSECT\nand\nEXCEPT\n– are not supported on variant type fields.\nSolution\nThere are three options.\n1. Execute the\ndropDuplicates()\noperation before converting to a variant type. (As an added benefit, removing duplicates before parsing may also reduce runtime.)\ndf = df.dropDuplicates()\r\ndf = df.withColumn(\"\", parse_json(\r\ncol(‘’)))\n2. Instead of using\nVariantType\nin a column, use a data type that is compatible with\ndropDuplicates()\n, such as Apache Spark's built-in functions to parse the semi-structured data into a\nStructType\nor\nMapType\n.​ The following code provides an example using\nMapType\n.\n#creating schema\r\njson_map_schema = MapType(StringType(), StringType())\r\n\r\n#adding the MapType field\r\ndf = df.withColumn(\"\", from_json(col(\"\"), json_map_schema)) \r\ndf = df.dropDuplicates()\n3. Implement a custom deduplication logic using window functions. For example, you can assign row numbers to partitions of data based on certain columns, and then filter out duplicates by selecting the first occurrence in each partition.\ndf = df.withColumn(\"\", parse_json(\r\ncol(‘’)))\r\n\r\n# using window function to drop duplicates\r\ndf = df.withColumn(\"row_num\", row_number().over(window_spec))\r\ndf = df.filter(col(\"row_num\") == 1).drop(\"row_num\")\nFor more information, review the\nVARIANT\ntype\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-timeout-error-during-maven-library-installation-on-databricks-cluster.json b/scraped_kb_articles/getting-timeout-error-during-maven-library-installation-on-databricks-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..34b0e818d107ad74db8442b046f8773c4d17196a --- /dev/null +++ b/scraped_kb_articles/getting-timeout-error-during-maven-library-installation-on-databricks-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/getting-timeout-error-during-maven-library-installation-on-databricks-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to install Maven libraries in your Databricks cluster, you experience long installation times and then timeout errors. The error message appears as either of the following.\nLibrary installation timed out after 1800 seconds.\nOr\nfailed: Network is unreachable (connect failed)\nCause\nYour workspace has restrictions on public network access, which is blocking the package installer from reaching the public Maven repository.\nSolution\nConfigure your cluster to use a private repo as the default repository and disable the default Maven Central resolver. Changing these settings forces the package installer to use the on-premise artifactory as the default repository, bypassing the need to reach the public Maven repo.\nIn your cluster settings, navigate to the\nAdvanced options > Spark\ntab. In the\nSpark config\nfield, add the following settings.\n1. Configure the default repository to your private repo.\nspark.databricks.driver.preferredMavenCentralMirrorUrl \n2. Disable the default Maven Central resolver.\nspark.databricks.driver.disableDefaultMavenCentralResolver true\n3. Disable the Apache Spark packages resolver.\nspark.databricks.driver.disableSparkPackagesResolver true" +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-valueerror-ndarray-is-not-supported-by-dataframe_to_mds-when-converting-an-apache-spark-dataframe-to-mds-format-using-mosaic-streaming.json b/scraped_kb_articles/getting-valueerror-ndarray-is-not-supported-by-dataframe_to_mds-when-converting-an-apache-spark-dataframe-to-mds-format-using-mosaic-streaming.json new file mode 100644 index 0000000000000000000000000000000000000000..10791a75c1b5e5a7cc862e96f62ead93b31ac81a --- /dev/null +++ b/scraped_kb_articles/getting-valueerror-ndarray-is-not-supported-by-dataframe_to_mds-when-converting-an-apache-spark-dataframe-to-mds-format-using-mosaic-streaming.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/getting-valueerror-ndarray-is-not-supported-by-dataframe_to_mds-when-converting-an-apache-spark-dataframe-to-mds-format-using-mosaic-streaming", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to convert an Apache Spark DataFrame that contains a 1D array column to MDS format using Mosaic Streaming. In your code, you are calling the\ndataframe_to_mds\nfunction, specifying the\nmds_kwargs\nparameter containing the columns field with your array column type as\nndarray\n.\nExample code\nfrom streaming.base.converters import dataframe_to_mds\r\n\r\nmds_kwargs = {'out': \"\", 'columns': {'id_col':'float64', 'array_data': 'ndarray'}}\r\ndataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)\nWhen you run this code, you get a value error.\nValueError: ndarray is not supported by dataframe_to_mds\nCause\nWhen specifying an array column as\nndarray\ntype in the columns field of the\nmds_kwargs\nparameter, it is necessary to append the data type of the elements of the array, which can be one of the following:\nArrayType(ShortType()): 'ndarray:int16'\nArrayType(IntegerType()): 'ndarray:int32’\nArrayType(LongType()): 'ndarray:int64'\nArrayType(FloatType()): 'ndarray:float32'\nArrayType(DoubleType()): ‘ndarray:float64’\nSolution\nProperly pass the data type of the elements of the array column in the\nmds_kwargs\n. In this example code,\nfloat64\nis specified as the data type. This resolves the issue.\nExample code\nfrom streaming.base.converters import dataframe_to_mds\r\n\r\nmds_kwargs = {'out': \"\", 'columns': {'id_col':'float64', 'array_data': 'ndarray:float64'}}\r\ndataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)\nNote\nEnsure you are using the\nmosaicml-streaming\npackage version 0.7.6 or above. Support for\nndarray\nwas added to in version 0.7.6." +} \ No newline at end of file diff --git a/scraped_kb_articles/getting-valueerror-when-trying-to-import-pmml-files-using-pypmml.json b/scraped_kb_articles/getting-valueerror-when-trying-to-import-pmml-files-using-pypmml.json new file mode 100644 index 0000000000000000000000000000000000000000..1f7940527ceb806c0ba0fe5b0a63476b3a812fad --- /dev/null +++ b/scraped_kb_articles/getting-valueerror-when-trying-to-import-pmml-files-using-pypmml.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/getting-valueerror-when-trying-to-import-pmml-files-using-pypmml", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou try to import\nPMML\nfiles using PyPMML. The following code shows an example command.\n%python\r\nfrom pypmml import Model\r\nmodelb = Model.fromFile('/dbfs/username/DecisionTreeIris.pmml')\nYou then receive the following error.\nValueError: invalid literal for int() with base 10: b'[Global flags]\\n'.\nCause\nPyPMML is designed to retrieve and return the port number on which the Py4J gateway is running when it initiates the gateway.\nHowever, included in the Java options there is a\n-XX:+PrintFlagsFinal\nflag which modifies the gateway's output. This flag produces a list of all global Java flags instead of just the port number. This extended output is incorrectly passed to the\nint()\nfunction, which expects a numerical value. As a result the\nint()\nfunction returns a parsing error.\nSolution\nRemove the flag from the Java options before calling the PyPMML method to remove the excess global list from the output. Execute the following code in a notebook.\ntmpval= os.environ.get(\"JAVA_OPTS\")\r\ntmpval=privtmpval.replace('-XX:+PrintFlagsFinal', '')\r\nos.environ[\"JAVA_OPTS\"] = tmpval" +} \ No newline at end of file diff --git a/scraped_kb_articles/git-integrated-workloads-fail-in-databricks-with-permission_denied-invalid-git-provider-credentials-error.json b/scraped_kb_articles/git-integrated-workloads-fail-in-databricks-with-permission_denied-invalid-git-provider-credentials-error.json new file mode 100644 index 0000000000000000000000000000000000000000..7dfb396f634c84441f3ca62cd884deccb1ab610b --- /dev/null +++ b/scraped_kb_articles/git-integrated-workloads-fail-in-databricks-with-permission_denied-invalid-git-provider-credentials-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/git-integrated-workloads-fail-in-databricks-with-permission_denied-invalid-git-provider-credentials-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Git integrated with your Databricks environment, your workloads fail with an error message.\nFailed to checkout Git repository: PERMISSION_DENIED: Invalid Git provider credentials. Go to User Settings > Git Integration to ensure that:\r\nYou have entered a username with your Git provider credentials.\r\nYou have selected the correct Git provider with your credentials.\r\nYour personal access token or app password has the correct repo access.\r\nYour personal access token has not expired.\r\nIf you have SSO enabled with your Git provider, be sure to authorize your token.\nCause\nYour Git provider credentials are invalid or misconfigured.\nSolution\nVerify that the Git provider username and password entered are correct.\nEnsure that the correct Git provider is selected in the User Settings > Git Integration section.\nCheck that the personal access token or app password has the necessary repository access permissions.\nConfirm that the personal access token has not expired. If it has, generate a new token and update the credentials.\nIf SSO is enabled with the Git provider, make sure to authorize the token appropriately.\nFor further reading and detailed instructions, refer to the\nConfigure Git credentials & connect a remote repo to Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nPreventative measures\nRegularly update and review Git provider credentials to ensure they are current and have the correct permissions.\nMonitor job runs and logs to quickly identify and address any credential-related issues.\nDocument and follow best practices for managing Git integrations and credentials in Databricks environments." +} \ No newline at end of file diff --git a/scraped_kb_articles/global-temp-view-not-found.json b/scraped_kb_articles/global-temp-view-not-found.json new file mode 100644 index 0000000000000000000000000000000000000000..abfac3e180fa0b525335ea2c89866c0cadb18be3 --- /dev/null +++ b/scraped_kb_articles/global-temp-view-not-found.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/global-temp-view-not-found", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to query a table or view, you get this error:\nAnalysisException:Table or view not found when trying to query a global temp view\nCause\nYou typically create global\ntemp\nviews so they can be accessed from different sessions and kept alive until the application ends. You can create a global\ntemp\nview with the following statement:\n%scala\r\n\r\ndf.createOrReplaceGlobalTempView(\"\")\nHere,\ndf\nis the\nDataFrame\n. Another way to create the view is with:\n%sql\r\n\r\nCREATE GLOBAL TEMP VIEW \nAll global temporary views are tied to a system temporary database named\nglobal_temp\n. If you query the global table or view without explicitly mentioning the\nglobal_temp\ndatabase, then the error occurs.\nSolution\nAlways use the qualified table name with the\nglobal_temp\ndatabase, so that you can query the global view data successfully.\nFor example:\n%sql\r\n\r\nselect * from global_temp.;" +} \ No newline at end of file diff --git a/scraped_kb_articles/google-ai-studio-key-fails-with-mosaic-ai-model-serving-through-vertex-ai-provider.json b/scraped_kb_articles/google-ai-studio-key-fails-with-mosaic-ai-model-serving-through-vertex-ai-provider.json new file mode 100644 index 0000000000000000000000000000000000000000..fa4112d5eeceb45322d9e0e746bbb879277190fc --- /dev/null +++ b/scraped_kb_articles/google-ai-studio-key-fails-with-mosaic-ai-model-serving-through-vertex-ai-provider.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/google-ai-studio-key-fails-with-mosaic-ai-model-serving-through-vertex-ai-provider", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re attempting to serve Gemini models through Mosaic AI Model Serving using the built-in Google Cloud Vertex AI provider. You use a Google AI Studio-generated API key in your configuration, and authentication fails with a generic internal error.\n(\"error_code\": \"INTERNAL_ERROR\", \"message\": \"Internal Error\"}\nCause\nYou’re trying to use the Google Cloud Vertex AI provider with a Google AI Studio API key.\nThe Google Cloud Vertex AI provider in Mosaic AI Model Serving is designed to work specifically with Google’s service account keys.\nSolution\nTo serve a Gemini model continuing to use a Google AI Studio API key, set up the external model in Mosaic AI Model Serving with the\nCustom Provider\nconfigured in an OpenAI-compatible format.\nUse the following configuration.\nProvider: Custom Provider\r\n\r\nCustom Provider Model URL: https://generativelanguage.googleapis.com/v1beta/openai/chat/completions\r\n\r\nCustom Provider Authentication Type: Bearer Token Authentication\r\n  \r\nBearer Token: {{secrets/scope-name/gemini-api-key}}  # Use secrets!\r\n\r\nModel Name: #gemini-2.0-flash\nFor further information on custom provider setup, refer to the “Custom provider” section of the\nExternal models in Mosaic AI Model Serving\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation. You can also refer to the “Google Cloud Vertex AI” section of the same documentation.\nFor information on Google Gemini OpenAI compatibility, refer to the Google\nOpenAI compatibility\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/gpu-metrics-indicate-that-the-gpu-is-not-being-used-during-model-inference.json b/scraped_kb_articles/gpu-metrics-indicate-that-the-gpu-is-not-being-used-during-model-inference.json new file mode 100644 index 0000000000000000000000000000000000000000..669a2d2cedc42b387cd300361a5aa56d99b56756 --- /dev/null +++ b/scraped_kb_articles/gpu-metrics-indicate-that-the-gpu-is-not-being-used-during-model-inference.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/gpu-metrics-indicate-that-the-gpu-is-not-being-used-during-model-inference", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a PyTorch model that you have already logged and registered in your workspace using MLflow. When loading it using the\nmlflow.pytorch.load_model()\nfunction and passing inputs to perform predictions in your Databricks notebook, you notice after some time that the\nCluster Metrics\npage shows 0% as GPU utilization for the cluster attached to the notebook.\nCause\nYou haven't loaded the model specifying the available GPU device when calling the\nmlflow.pytorch.load_model()\nfunction.\nSolution\nThe device parameter was added to the\nmlflow.pytorch.load_model()\nfunction\non Dec 27, 2023\nto allow the model to be sent to the defined device when loading it. You can solve the issue by following this code snippet example.\n# Get cpu or gpu for inference.\r\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\r\n\r\n# Define the Model URI\r\nlogged_model = f\"runs:/{}/model\"\r\nloaded_model = mlflow.pytorch.load_model(model_uri=logged_model, device=device)" +} \ No newline at end of file diff --git a/scraped_kb_articles/granting-select-permissions-to-specific-user-groups-on-a-subset-of-tables-only.json b/scraped_kb_articles/granting-select-permissions-to-specific-user-groups-on-a-subset-of-tables-only.json new file mode 100644 index 0000000000000000000000000000000000000000..440b38f615ad742dc256d6686b4e640a28037261 --- /dev/null +++ b/scraped_kb_articles/granting-select-permissions-to-specific-user-groups-on-a-subset-of-tables-only.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/granting-select-permissions-to-specific-user-groups-on-a-subset-of-tables-only", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Delta Live Table (DLT) pipelines, you want to grant SELECT permissions to specific user groups on a subset of pipeline-owned, dynamically-created streaming tables, rather than all the tables in the DLT pipeline.\nCause\nThe current functionality within Databricks does not support granting permissions directly from within the DLT pipeline for tables that are created dynamically.\nTables are created based on the configuration files, and the permissions need to be managed outside the pipeline.\nSolution\nRun the DLT pipeline once to create the necessary views and tables.\nAfter the first run, manually grant SELECT permissions to the specific user groups on the newly created tables. This can be done using Databricks SQL with the following command.\nGRANT SELECT ON TO ``\nAlternatively, you can use the Catalog UI. Refer to the instructions in the\nGrant permissions on objects in a Unity Catalog metastore\nsection of the\nManage privileges in Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nNote\nNote that after you have granted permissions on your pipeline tables, those permissions persist between pipeline runs. You do not need to repeat the steps to set permissions each time." +} \ No newline at end of file diff --git a/scraped_kb_articles/grayed-out-catalogs-appearing-in-a-deleted-unity-catalog-registered-workspace.json b/scraped_kb_articles/grayed-out-catalogs-appearing-in-a-deleted-unity-catalog-registered-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..5eb93613d44e0dad3c7db4f7e12733acd6d9dd4d --- /dev/null +++ b/scraped_kb_articles/grayed-out-catalogs-appearing-in-a-deleted-unity-catalog-registered-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/grayed-out-catalogs-appearing-in-a-deleted-unity-catalog-registered-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter an Azure Unity Catalog-registered workspace is deleted with the\n\"Delete default workspace storage permanently\"\noption selected, you try to access catalogs created in that workspace from another workspace.\nYou also notice:\nThese catalogs are grayed out in the catalog explorer.\nIt is not possible to change the ownership of the affected catalogs.\nRunning the\nSHOW CATALOGS;\ncommand as a metastore admin does not list or return the affected catalogs.\nCause\nPermanent deletion of workspace storage can lead to orphaned catalog entries in the Unity Catalog metastore. The catalogs lose their associated workspace context but remain in the system, now partially registered.\nSolution\nForce delete the catalogs in your metastore or bind the catalog(s) to a new workspace.\nForce delete a catalog\nImportant\nForce deletion requires both metastore owner or admin privileges for the metastore where the catalog resides, and workspace admin on the workspace where you see the grayed out catalogs.\nIf you are already an account admin, you can make your account a metastore admin. Review the “Assign a metastore admin” section of the\nAdmin privileges in Unity Catalog\ndocumentation.\nDownload the Databricks CLI to your local machine and authenticate your workspace. Instructions are available in the\nInstall or update the Databricks CLI\ndocumentation and the\nAuthentication for the Databricks CLI\ndocumentation.\nRun the following command.\ndatabricks unity-catalog catalogs delete --name  --force\nBind a catalog to another workspace\nIf you want to keep a catalog and assign it to another workspace, a metastore admin or the catalog owner can bind the catalog to this other workspace.\nReview the “Bind a catalog to one or more workspaces” section of the\nLimit catalog access to specific workspaces\ndocumentation for instructions." +} \ No newline at end of file diff --git a/scraped_kb_articles/h2o-cluster-not-reachable-exception.json b/scraped_kb_articles/h2o-cluster-not-reachable-exception.json new file mode 100644 index 0000000000000000000000000000000000000000..c0e9c7ed90ccc43f1079386bbd2d08e24169ecb9 --- /dev/null +++ b/scraped_kb_articles/h2o-cluster-not-reachable-exception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/h2o-cluster-not-reachable-exception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to initialize\nH2O.ai\n’s\nSparkling Water\non Databricks Runtime 7.0 and above when you get a\nH2OClusterNotReachableException\nerror message.\n%python\r\n\r\nimport ai.h2o.sparkling._\r\nval h2oContext = H2OContext.getOrCreate()\nai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster X.X.X.X:54321 - sparkling-water-root_app-20210720231748-0000 is not reachable.\nCause\nThis error occurs when you are trying to use a version of the Sparkling Water package which is not compatible with the version of Apache Spark used on your Databricks cluster.\nSolution\nMake sure you are downloading the correct version of Sparkling Water from the\nSparkling Water download\npage.\nBy default, the download page provides the latest version of Sparkling Water. If you are still having trouble, you may want to try rolling back to a prior version of Sparkling Water that is compatible with your Spark version.\nIf you are still having trouble configuring Sparkling Water, open a case with\nH20.ai support\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/handling-case-sensitivity-issues-in-delta-lake-nested-fields.json b/scraped_kb_articles/handling-case-sensitivity-issues-in-delta-lake-nested-fields.json new file mode 100644 index 0000000000000000000000000000000000000000..fa8f7c150ad76008bf2f4995ce8c380b85cee907 --- /dev/null +++ b/scraped_kb_articles/handling-case-sensitivity-issues-in-delta-lake-nested-fields.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/handling-case-sensitivity-issues-in-delta-lake-nested-fields", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nApache Spark streaming jobs in Delta Lake may fail with errors indicating that the input schema contains nested fields that are capitalized differently than the target table.\n[DELTA_NESTED_FIELDS_NEED_RENAME]\nThe input schema contains nested fields that are capitalized differently than the target table. They need to be renamed to avoid the loss of data in these fields while writing to Delta.\nSpark generally ignoring case in data columns is distinct from this error.\nNote\nThis article applies to Databricks Runtime 14.3 and below.\nCause\nWhile top-level fields in Delta Lake are case insensitive, nested fields must match the case exactly as defined in the table schema.\nSolution\nSet a specific property in your Spark configuration to handle the case sensitivity of nested fields in Delta tables.\nSet the following property in your Spark configuration, which corrects the case of nested field names automatically to match the target table's schema.\nspark.conf.set(\"spark.databricks.delta.nestedFieldNormalizationPolicy\", \"cast\")\nFor further information, please review the\nError classes in Databricks\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/handling-data-duplication-issues-with-databricks-autoloader-and-delta-lake-using-replacewhere.json b/scraped_kb_articles/handling-data-duplication-issues-with-databricks-autoloader-and-delta-lake-using-replacewhere.json new file mode 100644 index 0000000000000000000000000000000000000000..f167b9f82fe64848d35bcc6a51022ea3e5675ec0 --- /dev/null +++ b/scraped_kb_articles/handling-data-duplication-issues-with-databricks-autoloader-and-delta-lake-using-replacewhere.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/handling-data-duplication-issues-with-databricks-autoloader-and-delta-lake-using-replacewhere", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using\nreplaceWhere\nduring data ingestion to overwrite specific data partitions in a Delta table, you notice that new data are appended to, instead of replacing, old data, causing duplicates.\nCause\nThe\nreplaceWhere\noption is intended to be used during the write operation, not the read operation. When used during the read operation, it produces the problem described.\nSolution\nModify your Delta Lake pipeline to include the\nreplaceWhere\noption during the write operation.\nExample\nIn this example, the\nreplaceWhere\noption is used to atomically replace all records in the month of January 2017 in the target table with the data in\nreplace_data.\nreplace_data.write \\\r\n.mode(\"overwrite\") \\\r\n.option(\"replaceWhere\", \"start_date >= '2017-01-01' AND end_date <= '2017-01-31'\") \\\r\n.save(\"/tmp/delta/events\")\nIf you are using Delta Live Tables (DLT), use Data Manipulation Language (DML) to remove or drop duplicates. You can use the\nINSERT INTO REPLACE WHERE\nstatement on your target streaming table to eliminate duplicates.\nINSERT INTO target_table\r\nREPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31'" +} \ No newline at end of file diff --git a/scraped_kb_articles/handling-warn-message-could-not-turn-on-cdf-for-table-xxx-in-delta-live-tables-pipeline.json b/scraped_kb_articles/handling-warn-message-could-not-turn-on-cdf-for-table-xxx-in-delta-live-tables-pipeline.json new file mode 100644 index 0000000000000000000000000000000000000000..ccdc69603fedcf7ad0eb5c8ee9a777eb44b04856 --- /dev/null +++ b/scraped_kb_articles/handling-warn-message-could-not-turn-on-cdf-for-table-xxx-in-delta-live-tables-pipeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/handling-warn-message-could-not-turn-on-cdf-for-table-xxx-in-delta-live-tables-pipeline", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile running the Databricks Delta Live Tables (DLT) pipeline, you encounter a WARN message in DLT event logs.\nCould not turn on CDF for table . The table contains reserved columns  [_change_type, _commit_version, _commit_timestamp] that will be used internally as metadata for the table's Change Data Feed. Change Data Feed is required for certain features in DLT. If you wish to turn it on, please rename/drop the reserved columns.\nCause\nBy default, DLT creates all tables with Change Data Feed (CDF) enabled. When reading source data with CDF enabled (\nreadChangeFeed=true\n), the source DataFrame includes reserved columns like\n[_change_type, _commit_version, _commit_timestamp]\n.\nWhen DLT attempts to create the target table with CDF enabled (with reserved columns), it throws a WARN message due to the ambiguity of these columns. Consequently, DLT falls back and creates the table without CDF enabled.\nThis WARN message does not fail the pipeline but indicates that the table creation process with CDF enabled encountered an issue with reserved columns.\nSolution\nTo avoid a WARN message, first use the\nexcept_column_list\nparameter inside\ndlt.apply_changes()\nto exclude the reserved columns.\nexcept_column_list = [\"_change_type\", \"_commit_version\", \"_commit_timestamp\"]\nThen, for append-only DLT streaming tables, drop the reserved columns.\n@dlt.table(\r\n    name=\"\"\r\n)\r\ndef table():\r\n    exclude_columns = [\"_change_type\", \"_commit_version\", \"_commit_timestamp\"]\r\n    df = spark.readStream.format(\"delta\").option(\"readChangeFeed\",\"true\").table(\"\")\r\n    return df.drop(*exclude_columns)\nFor more information, please review the\nThe APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/hdfs-to-read-files.json b/scraped_kb_articles/hdfs-to-read-files.json new file mode 100644 index 0000000000000000000000000000000000000000..dc38e5b0a46170652233f61f84275133182053df --- /dev/null +++ b/scraped_kb_articles/hdfs-to-read-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/hdfs-to-read-files", + "title": "Título do Artigo Desconhecido", + "content": "There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts.\nAWS\nUse the following example code for S3 bucket storage.\n%python\r\n\r\nURI = sc._gateway.jvm.java.net.URI\r\nPath = sc._gateway.jvm.org.apache.hadoop.fs.Path\r\nFileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem\r\nconf = sc._jsc.hadoopConfiguration()\r\nfs = Path('s3a:///').getFileSystem(sc._jsc.hadoopConfiguration())\r\nistream = fs.open(Path('s3a:///'))\r\n\r\nreader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))\r\n\r\nwhile True:\r\n  thisLine = reader.readLine()\r\n  if thisLine is not None:\r\n    print(thisLine)\r\n  else:\r\n    break\r\n\r\nistream.close()\nwhere\n\nis the name of the S3 bucket.\n\nis the full path to the file.\nDelete\nAzure\nUse the following example code for Azure Blob storage.\n%python\r\n\r\nURI = sc._gateway.jvm.java.net.URI\r\nPath = sc._gateway.jvm.org.apache.hadoop.fs.Path\r\nFileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem\r\nconf = sc._jsc.hadoopConfiguration()\r\n\r\nconf.set(\r\n  \"fs.azure.account.key..blob.core.windows.net,\r\n  \"\")\r\n\r\nfs = Path('wasbs://@.blob.core.windows.net//').getFileSystem(sc._jsc.hadoopConfiguration())\r\nistream = fs.open(Path('wasbs://@.blob.core.windows.net//'))\r\n\r\nreader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))\r\n\r\nwhile True:\r\n  thisLine = reader.readLine()\r\n  if thisLine is not None:\r\n    print(thisLine)\r\n  else:\r\n    break\r\n\r\nistream.close()\nwhere\n\nis your Azure account name.\n\nis the container name.\n\nis the full path to the file.\n\nis the account access key.\nDelete" +} \ No newline at end of file diff --git a/scraped_kb_articles/hive-metastore-troubleshooting.json b/scraped_kb_articles/hive-metastore-troubleshooting.json new file mode 100644 index 0000000000000000000000000000000000000000..e4bdbddf30b4ab6502e71e23afa39ff7f23344eb --- /dev/null +++ b/scraped_kb_articles/hive-metastore-troubleshooting.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/hive-metastore-troubleshooting", + "title": "Título do Artigo Desconhecido", + "content": "Problem 1: External metastore tables not available\nWhen you inspect the driver logs, you see a stack trace that includes the error\nRequired table missing:\nWARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no possible candidates\r\n\r\nRequired table missing: \"DBS\" in Catalog \"\" Schema \"\". DataNucleus requires this table to perform its\r\npersistence operations. Either your MetaData is incorrect, or you need to enable\r\n\"datanucleus.schema.autoCreateTables\"\r\n\r\norg.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : \"DBS\" in Catalog \"\"  Schema \"\". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable\r\n\"datanucleus.schema.autoCreateTables\"\r\n\r\n   at\r\n\r\norg.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606)\r\n\r\n   at\r\n\r\norg.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:33\r\n85)\nCause\nThe database is present, but there are no metastore tables.\nSolution\nIf the external metastore version is Hive 2.0 or above, use the\nHive Schema Tool\nto create the metastore tables. For versions below Hive 2.0, add the metastore tables with the following configurations in your existing init script:\nspark.hadoop.datanucleus.autoCreateSchema=true\r\nspark.hadoop.datanucleus.fixedDatastore=false\nYou can also set these configurations in the Apache\nSpark config\n(\nAWS\n|\nAzure\n) directly:\ndatanucleus.autoCreateSchema true\r\ndatanucleus.fixedDatastore false\nProblem 2: Hive metastore verification failed\nWhen you inspect the driver logs, you see a stack trace that includes an error like the following:\n18/09/24 14:51:07 ERROR RetryingHMSHandler: HMSHandler Fatal error:\r\nMetaException(message:Version information not found in metastore. )\r\n\r\n   at\r\norg.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore\r\n.java:7564)\r\n\r\n   at\r\norg.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.\r\njava:7542)\r\n\r\n   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nCause\nThe\nVERSION\ntable in the metastore is empty.\nSolution\nDo one of the following:\nPopulate the\nVERSION\ntable with the correct version values using an\nINSERT\nquery.\nSet the following configurations to turn off the metastore verification in the Spark configuration of the cluster:\nhive.metastore.schema.verification false\r\nhive.metastore.schema.verification.record.version false\nProblem 3: Metastore connection limit exceeded\nCommands run on the cluster fail with the following stack trace in the driver logs:\nUnable to open a test connection to the given\r\ndatabase. JDBC url =\r\njdbc:?trustServerCertificate=true&useSS\r\nL=true, username = . Terminating\r\nconnection pool (set lazyInit to true if you\r\nexpect to start your database after your app).\r\nOriginal Exception: ------\r\n\r\njava.sql.SQLSyntaxErrorException: User\r\n'' has exceeded the\r\n'max_user_connections' resource (current value:\r\n100)\r\nat\r\norg.mariadb.jdbc.internal.util.exceptions.Except\r\nionMapper.get(ExceptionMapper.java:163)\r\nat\r\norg.mariadb.jdbc.internal.util.exceptions.Except\r\nionMapper.getException(ExceptionMapper.java:106)\r\nat\r\norg.mariadb.jdbc.internal.protocol.AbstractConne\r\nctProtocol.connectWithoutProxy(AbstractConnectPr\r\notocol.java:1036)\nCause\nThe metastore configuration allows only 100 connections. When the connection limit is reached, new connections are not allowed, and commands fail with this error. Each cluster in the Databricks workspace establishes a connection with the metastore. If you have a large number of clusters running, then this issue can occur. Additionally, incorrect configurations can cause a connection leak, causing the number of connections to keep increasing until the limit is reached.\nSolution\nCorrect the problem with one of the following actions:\nIf you are using an external metastore and you have a large number of clusters running, then increase the connection limit on your external metastore.\nIf you are not using an external metastore, ensure that you do not have any custom Hive metastore configurations on your cluster. When using the metastore provided by Databricks, you should use the default configurations on the cluster for the Hive metastore.\nIf you are using the default configuration and still encounter this issue, contact Databricks Support. Depending on the configuration of your Databricks workspace, it might be possible to increase the number of connections allowed to the internal metastore.\nProblem 4: Table actions fail because column has too much metadata\nWhen the quantity of metadata for a single column exceeds 4000 characters, table actions fail with an error like this:\nError in SQL statement: IllegalArgumentException:\r\nError: type expected at the position 3998 of 'struct\r\n        // Use Scala string interpolation. It's the easiest way, and it's\r\n        // type-safe, unlike String.format().\r\n        f\"0x${num.get}%x\"\r\n    }\r\n    .getOrElse(\"\")\r\n  }\r\n}\nRegister the function:\n%scala\r\n\r\nspark.sql(\"CREATE TEMPORARY FUNCTION to_hex AS 'com.ardentex.spark.hiveudf.ToHex'\")\nUse your function as any other registered function:\n%scala\r\n\r\nspark.sql(\"SELECT first_name, to_hex(code) as hex_code FROM people\")\nYou can find more examples and compilable code at the\nSample Hive UDF project\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-bulk-assign-permissions-to-catalogs.json b/scraped_kb_articles/how-to-bulk-assign-permissions-to-catalogs.json new file mode 100644 index 0000000000000000000000000000000000000000..bac30a4bd653c98f983546d264007c481adb088f --- /dev/null +++ b/scraped_kb_articles/how-to-bulk-assign-permissions-to-catalogs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/how-to-bulk-assign-permissions-to-catalogs", + "title": "Título do Artigo Desconhecido", + "content": "How-to Introduction\nYou have several catalogs in your environment and want to delete the catalogs you don't need. Currently to delete a catalog, you must grant ALL privileges to a user or group on specific catalogs one at a time in the UI.\nYou want a way to bulk-assign permission to multiple catalogs at once.\nInstructions\nIn Unity Catalog, it is not possible to directly grant or retrieve ALL privileges on all, or multiple, catalogs with a single query due to security and granularity considerations.\nInstead, you can use Python to get a catalog list and loop through each catalog explicitly to apply GRANT statements, and then programmatically delete catalogs one at a time.\nFirst, run the following code to gather and store your catalog names in a Python list.\ncatalogs = spark.sql(\"show catalogs\").collect()\r\ndisplay(catalogs)\r\n\r\n# Convert the collected results to a Python list\r\ncatalog_list = [row['catalog'] for row in catalogs]\r\n\r\n# Print the list\r\nprint(catalog_list)\nThen use the following script to loop through the catalogs and give a user access as required to then delete. This code grants ALL PRIVILEGES access, but you can adjust to the level of access you choose. The code then drops the catalogs individually which you now have permission to drop.\ncatalog_list = ['', '']\r\n \r\nfor catalog_name in catalog_list:\r\n   if catalog_name not in [\"samples\"]: # Exclude sample catalogs\r\n    grant_query = f\"GRANT ALL PRIVILEGES ON CATALOG {catalog_name} TO ``\"\r\n    spark.sql(grant_query)\r\n    print(f\"Permissions granted to {catalog_name} successfully.\")\r\n\r\ndrop_query = f\"DROP CATALOG IF EXISTS {catalog_name} CASCADE\"\r\n    spark.sql(drop_query)\r\n    print(f\"Catalog {catalog_name} dropped successfully.\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-configure-a-compute-policy-to-include-databricks-runtime-versions-using-ml-compute.json b/scraped_kb_articles/how-to-configure-a-compute-policy-to-include-databricks-runtime-versions-using-ml-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..b66bb5f6be484c6c236f8c9acc0cfde4a7d0a8f4 --- /dev/null +++ b/scraped_kb_articles/how-to-configure-a-compute-policy-to-include-databricks-runtime-versions-using-ml-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/how-to-configure-a-compute-policy-to-include-databricks-runtime-versions-using-ml-compute", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou're using a compute policy to restrict the Databricks Runtime version for your clusters. You set your current configuration for standard computes to allow the following Apache Spark versions.\n\"spark_version\": {\r\n\"defaultValue\": \"15.4.x-scala2.12\",\r\n\"type\": \"allowlist\",\r\n\"values\": [\r\n\"13.3.x-scala2.12\",\r\n\"14.3.x-scala2.12\",\r\n\"15.4.x-scala2.12\"\r\n]}\nHowever, you now want to extend this configuration to support ML computes, both standard and those with GPUs, and you're unsure about the correct Spark version values to use for ML computes.\nInstructions\nML compute requires a Databricks Runtime version that includes ML support, which is typically denoted by a different naming convention compared to the standard Spark runtime versions. For instance, ML runtimes are usually marked with an ML suffix such as\n15.4.x-cpu-ml-scala2.12\nor\n15.4.x-gpu-ml-scala2.12\nfor GPU-enabled ML computes.\nTo extend your compute policy to support both standard and GPU ML computes, include the appropriate Databricks Runtime ML versions in your\nspark_version\nallowlist.\n1. Identify the ML runtime versions that correspond to the Spark versions you're currently using. For example, if you use\n13.3.x-scala2.12\nthen look for\n13.3.x-cpu-ml-scala2.12\n(Standard) and\n13.3.x-gpu-ml-scala2.12\n(GPU).\n2. Modify your\nspark_version\nconfiguration to include both the standard Spark runtime versions and their corresponding ML versions. The following code provides an example.\n\"spark_version\": {\r\n  \"type\": \"allowlist\",\r\n  \"values\": [\r\n    \"13.3.x-scala2.12\",\r\n    \"13.3.x-cpu-ml-scala2.12\",\r\n    \"13.3.x-gpu-ml-scala2.12\",\r\n    \"14.3.x-scala2.12\",\r\n    \"14.3.x-cpu-ml-scala2.12\",\r\n    \"14.3.x-gpu-ml-scala2.12\",\r\n    \"15.4.x-scala2.12\",\r\n    \"15.4.x-cpu-ml-scala2.12\",\r\n    \"15.4.x-gpu-ml-scala2.12\"\r\n  ],\r\n  \"defaultValue\": \"15.4.x-cpu-ml-scala2.12\" # Or any other suitable default\r\n}\nIf needed, you can run the Python code below in a notebook to list all the Spark version values available and able to be used in your compute policy.\nfrom databricks.sdk import WorkspaceClient\r\n\r\nw = WorkspaceClient()\r\nruntimes = w.clusters.spark_versions().versions\r\n\r\nfor r in runtimes:\r\n    print(r.key)" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-configure-a-service-principal-with-git-credentials-for-jobs-with-git-source.json b/scraped_kb_articles/how-to-configure-a-service-principal-with-git-credentials-for-jobs-with-git-source.json new file mode 100644 index 0000000000000000000000000000000000000000..ef5081612236129969aef01a199eede9629557ef --- /dev/null +++ b/scraped_kb_articles/how-to-configure-a-service-principal-with-git-credentials-for-jobs-with-git-source.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/how-to-configure-a-service-principal-with-git-credentials-for-jobs-with-git-source", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou may require your Databricks jobs to access notebooks stored in external Git providers, such as GitHub or Azure DevOps, using a non-interactive service principal. You need to ensure the service principal can seamlessly authenticate and retrieve the necessary Git credentials from the API to avoid job failure due to missing credentials.\nInstructions\nWhen you manage notebooks in an external Git repository, you must give the Databricks environment access to those Git resources. By default, Databricks jobs use a user-based token or interactive authentication to clone or fetch code from Git. For jobs run programmatically using service principals, you need to provide the service principal with valid credentials to authenticate against the Git provider.\nFirst, generate a Personal Access Token (PAT) in your Git provider with appropriate scopes to access the repository.\nNext, create an on-behalf-of (OBO) token for your service principal. Use endpoint\nPOST /api/2.0/token-management/on-behalf-of/tokens\nand make a request.\nYou can modify and use the following code in a notebook using an interactive compute cluster or from your local terminal.\n%sh\r\n# Set up the environment variables\r\nexport DATABRICKS_WORKSPACE_URL=\"\"\r\nexport OBO_TOKEN=\"\"\r\n\r\ncurl -X POST https:///api/2.0/token-management/on-behalf-of/tokens \\\r\n  -H \"Authorization: Bearer \" \\\r\n  -H \"Content-Type: application/json\" \\\r\n  -d '{\r\n        \"application_id\": \"\",\r\n        \"comment\": \"...\",\r\n        \"lifetime_seconds\": 360000\r\n      }'\nFor details, refer to the\nCreate on-behalf token\n(\nAWS\n|\nGCP\n) API documentation.\nNote\nIf you use Azure, Databricks recommends using Microsoft Entra ID (formerly Azure Active Directory or AAD) authentication for service principals. Alternatively, you can create a PAT token for a service principal. For details on how to accomplish both, refer to the “Manage tokens for a service principal” of the\nManage service principals\ndocumentation.\nThe request returns an API token in response. The following is an example response.\n{\r\n  \"token_value\": \"\", \r\n  \"token_info\": {\r\n    \"token_id\": \"\",\r\n    \"creation_time\":  ,\r\n    \"expiry_time\": ,\r\n    \"comment\": \"...\",\r\n    \"created_by_id\": ,\r\n    \"created_by_username\": \"\",\r\n    \"owner_id\": \r\n  }\r\n}\nThen, use the returned token in\n\"token_value\"\nto create Git credentials through an API call. You can modify and run the following commands in a notebook using an interactive compute cluster.\n%sh\r\n# Set up the environment variables\r\nexport DATABRICKS_WORKSPACE_URL=\"\"\r\nexport OBO_TOKEN=\"\"\r\n\r\n\r\n# Perform the PATCH request\r\n\r\n\r\ncurl --location --request POST \"${DATABRICKS_WORKSPACE_URL}/api/2.0/git-credentials\" \\\r\n--header \"Authorization: Bearer ${OBO_TOKEN}\" \\\r\n--header \"Content-Type: application/json\" \\\r\n--data-raw '{\r\n \"git_provider\": \"\",\r\n \"personal_access_token\": \"\",\r\n \"git_username\": \"\"\r\n}'\nNote\nYou can also save the value of\n\"token_value\"\nin a Databricks secret for safety and reference it in the code. For more information, refer to the\nSecret management\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nLast, when setting up the job in Databricks, specify your service principal as the job’s run-as user. The job should now have access to the Git credentials. For more information, refer to the\nCreate a credential entry\n(\nAWS\n|\nAzure\n|\nGCP\n) API documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-create-alerts-for-expiring-personal-access-tokens-pats.json b/scraped_kb_articles/how-to-create-alerts-for-expiring-personal-access-tokens-pats.json new file mode 100644 index 0000000000000000000000000000000000000000..cfeca282d1be68f62c3f3a1bae703829225182ed --- /dev/null +++ b/scraped_kb_articles/how-to-create-alerts-for-expiring-personal-access-tokens-pats.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/how-to-create-alerts-for-expiring-personal-access-tokens-pats", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to create a Databricks function to alert you when created personal access tokens (PATs) are about to expire.\nInstructions\nYou can create a custom function. Use Databricks system tables and SQL alerts to set alerts when a PAT is about to expire.\nFirst get the expiry date of any PATs using the following query. Make sure to save the SQL query through the SQL editor UI. Adjust\ncurrent_date()+X\nto the number of days in the future the PATs are scheduled to expire. For example,\n+3\nreturns any PAT due to expire in three days.\nSELECT from_unixtime(request_params.tokenExpirationTime / 1000) as expiry_date\r\nFROM system.access.audit\r\nWHERE \r\n service_name = 'accounts'\r\n AND action_name = 'generateDbToken'\r\n AND to_date(from_unixtime(request_params.tokenExpirationTime / 1000)) = current_date()+X;\nThe following screenshot shows the code with\n+3\n, and the output of one column with expiration dates.\nThen, to create a notification, leverage SQL alerts. Set a trigger condition according to your use case. For details about SQL alerts, refer to the “Create an alert” section of the\nDatabricks SQL alerts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-create-an-init-script-to-collect-tcp_dumps.json b/scraped_kb_articles/how-to-create-an-init-script-to-collect-tcp_dumps.json new file mode 100644 index 0000000000000000000000000000000000000000..e4d7400dd507fc48b0e8f6150cf0f0cdabf50f1f --- /dev/null +++ b/scraped_kb_articles/how-to-create-an-init-script-to-collect-tcp_dumps.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/how-to-create-an-init-script-to-collect-tcp_dumps", + "title": "Título do Artigo Desconhecido", + "content": "How to collect tcp_dumps\nCreate a path and the init script, add the init script to an allowlist, configure the init script, then locate and download the pcap files.\nCreate a volume or workspace and then the tcp_dump init script\nCreate a volume or workspace and provide the path to store the init script.\nRun the following sample init script in a notebook to collect\ntcp_dumps\n. This script uses curl to make a DBFS\nPUT\nAPI call to the workspace to upload the pcap files. You need to pass the token and the workspace host in the API path in the script. Your DAPI token should be an existing valid PAT (string) that has permission to use DBFS PUT API.\nIf you want to filter the\ntcp_dumps\nwith host and port, you can uncomment the\nTCPDUMP_FILTER\nline in the script and add the required host and port. Alternatively, you can pass the host and port separately, depending on your requirements.\nThe following code references a volume path. For workspace paths, change\nVolumes\nto\nWorkspace\n.\ndbutils.fs.put(\"/Volumes//tcp_dumps.sh\", \"\"\"\r\n#!/bin/bash\r\n\r\nset -euxo pipefail\r\n\r\nMYIP=$(echo $HOSTNAME)\r\nTMP_DIR=\"/local_disk0/tmp/tcpdump\"\r\n\r\n[[ ! -d ${TMP_DIR} ]] && mkdir -p ${TMP_DIR}\r\nTCPDUMP_WRITER=\"-w ${TMP_DIR}/trace_%Y%m%d_%H%M%S_${DB_CLUSTER_ID}_${MYIP}.pcap -W 1000 -G 900 -Z root -U -s256\"\r\nTCPDUMP_PARAMS=\"-nvv -K\"\r\n#TCPDUMP_FILTER=\"host xxxxxxxxx.dfs.core.windows.net and port 443\" ## add host/port filter here based on the requirement\r\n\r\nsudo tcpdump $(echo \"${TCPDUMP_WRITER}\") $(echo \"${TCPDUMP_PARAMS}\") $(echo \"${TCPDUMP_FILTER}\") &\r\necho \"Started tcpdump $(echo \"${TCPDUMP_WRITER}\") $(echo \"${TCPDUMP_PARAMS}\") $(echo \"${TCPDUMP_FILTER}\")\"\r\n\r\ncat > /tmp/copy_stats.sh << 'EOF'\r\n#!/bin/bash\r\n\r\nTMP_DIR=$1\r\nDB_CLUSTER_ID=$2\r\nCOPY_INTERVAL_IN_SEC=45\r\nMYIP=$(echo $HOSTNAME)\r\necho \"Starting copy script at `date`\"\r\n\r\nDEST_DIR=\"/Volumes/main/default/jar/\"\r\n#mkdir -p ${DEST_DIR}\r\n\r\nsleep_duration=45\r\n\r\nlog_file=\"/tmp/copy_stats.log\"\r\ntouch $log_file\r\n\r\ndeclare -gA file_sizes\r\n\r\n## logic to copy  files by checking previous size. Uses associative array to persist rotated files size.\r\n\r\nwhile true; do\r\n sleep ${COPY_INTERVAL_IN_SEC}\r\n #ls -ltr ${DEST_DIR} >  $log_file\r\n for file in $(find \"$TMP_DIR\" -type f -mmin -3 ); do\r\n   current_size=$(stat -c \"%s\" \"$file\")\r\n   file_name=$(basename \"$file\")\r\n   last_size=${file_sizes[\"$file_name\"]}\r\n   if [ \"$current_size\" != \"$last_size\" ]; then\r\n       echo \"Copying $file with current size: $current_size and last size: $last_size at `date`\" | tee -a $log_file\r\n       DBFS_PATH=\"dbfs:/FileStore/tcpdumpfolder/${DB_CLUSTER_ID}/trace_$(date +\"%Y-%m-%d--%H-%M-%S\")_${DB_CLUSTER_ID}_${MYIP}.pcap\"\r\n       curl -vvv -F contents=@$file -F path=\"$DBFS_PATH\" -H \"Authorization: Bearer \" https:///api/2.0/dbfs/put  2>&1 | tee -a $log_file\r\n       #cp --verbose \"$file\" \"$DEST_DIR\" | tee -a $log_file\r\n       echo \"done Copying $file with current size: $current_size at `date`\" | tee -a $log_file\r\n\r\n       file_sizes[$file_name]=$current_size\r\n   else\r\n       echo \"Skip Copying $file with current size: $current_size and last size: $last_size at `date`\" | tee -a $log_file\r\n   fi\r\n done\r\n done\r\n\r\nEOF\r\n\r\nchmod a+x /tmp/copy_stats.sh\r\n/tmp/copy_stats.sh $TMP_DIR $DB_CLUSTER_ID & disown              \r\n\"\"\", True)\nNote the path to the init script. You will need it when configuring your cluster.\nAdd the init script to the allowlist\nFollow the instructions to add the init script to the allowlist in the\nAllowlist libraries and init scripts on compute with standard access mode (formerly shared access mode)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nConfigure the init script\nFollow the instructions to configure a cluster-scoped init script in the\nCluster-scoped init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSpecify the path to the init script. Use the same path that you used in the preceding script. (\n/Volumes//tcp_dump.sh\nor\n/Workspace//tcp_dump.sh\n)\nAfter configuring the init script, restart the cluster.\nLocate the pcap files\nOnce the cluster has started, it automatically starts creating pcap files containing the recorded network information. Locate the pcap files in the folder\ndbfs:/FileStore/tcpdumpfolder/${DB_CLUSTER_ID}\n.\nDownload the pcap files\nDownload the pcap files from the DBFS path to your local host for analysis. There are multiple ways to download files to your local machine. One option is the Databricks CLI. For more information, review the\nWhat is the Databricks CLI?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-disable-jdbc-on-all-purpose-compute.json b/scraped_kb_articles/how-to-disable-jdbc-on-all-purpose-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..6bc3761eceede55dfec5bdb04b15d0aaba56aba3 --- /dev/null +++ b/scraped_kb_articles/how-to-disable-jdbc-on-all-purpose-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/how-to-disable-jdbc-on-all-purpose-compute", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to disable the JDBC connection on an all purpose compute but don’t see an option in the UI to do so.\nInstructions\nAdd the following init script to the all-purpose compute to disable the JDBC connection.\n#!/bin/bash\r\ncat > /databricks/common/conf/disable_jdbc_odbc.conf << EOL\r\n{\r\n databricks.daemon.driver.startJDBC = false\r\n}\r\nEOL\nYou may see a 502 error when you disable JDBC. This error is expected, and confirms your init script is working as intended.\nIf you attempt to connect from a local machine using a Java file and the Databricks JDBC driver, a similar error is returned.\nCommunication link failure. Failed to connect to server. Reason: HTTP Response code: 502, Error message: Unknown." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-efficiently-manage-state-store-files-in-apache-spark-streaming-applications.json b/scraped_kb_articles/how-to-efficiently-manage-state-store-files-in-apache-spark-streaming-applications.json new file mode 100644 index 0000000000000000000000000000000000000000..82bca983b085b65a0e3ed632ef3ad28e6f095667 --- /dev/null +++ b/scraped_kb_articles/how-to-efficiently-manage-state-store-files-in-apache-spark-streaming-applications.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/how-to-efficiently-manage-state-store-files-in-apache-spark-streaming-applications", + "title": "Título do Artigo Desconhecido", + "content": "To prevent the indefinite growth of your State Store (even when the watermark is updated), you can improve how efficiently you manage the lifecycle of your state store files in\nApache Spark Structured Streaming applications\n.\nThis applies to both Hadoop Distributed File System and RocksDB-based providers.\nHandling instructions\nIn any Stateful Streaming application, the streaming engine creates and manages two types of state files. You can find these in your checkpoint state folder:\n*.delta\nand\n*.snapshot\nfiles. Let's examine their lifecycle:\nCreation\nThe streaming engine creates these files as part of the state management process. The\n.delta\nfiles are created for every batch of data processed by the application, while the\n.snapshot\nfiles are generated periodically to provide a consolidated view of the state at a given point in time. This mechanism ensures efficient state recovery in case of application failure.\nDeletion\nBackground maintenance threads running on Spark Executors manage file deletion. These threads are responsible for periodically deleting these files, which helps in managing the lifecycle of the state store files.\nConfiguration instructions\nThe\nspark.sql.streaming.minBatchesToRetain\nconfiguration sets the number of delta files retained in the checkpoint location. By default, this is set to\n100\n, but you can adjust this number based on your specific requirements. Reducing this number will result in fewer files being retained in your checkpoint location.\nThe\nspark.sql.streaming.stateStore.maintenanceInterval\nconfiguration sets the interval between triggering maintenance tasks in the StateStore. These maintenance tasks are executed as background tasks and play a crucial role in managing the lifecycle of the state store files. They can also impact the performance of the Spark Streaming applications. Under normal situations, the default interval is sufficient.\nPlease note that these configurations should be carefully evaluated and adjusted based on the specific requirements and constraints of your Spark Streaming applications." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-fetch-databricks-resource-permissions.json b/scraped_kb_articles/how-to-fetch-databricks-resource-permissions.json new file mode 100644 index 0000000000000000000000000000000000000000..dac6cac5e841e6093e268b3b74818f1abebcf09c --- /dev/null +++ b/scraped_kb_articles/how-to-fetch-databricks-resource-permissions.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/how-to-fetch-databricks-resource-permissions", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to understand the permissions of a user/group for all the resources on a workspace level. You can’t get them in one go and have to browse the permissions individually on resources.\nCause\nWhen you want to remove a user or a group from a workspace, you want to make sure there are no significant resources that are associated with these users or groups before deleting them. This is because if a user or group owns a resource, deleting it without changing the ownership would impact the functionality.\nSolution\nYou can fetch the permissions of the concerned user or group programmatically in one go by running an automation script in your workspace.\nThis Python script uses the Databricks REST API to retrieve and display the members of a specified group and their access permissions to SQL warehouses. After authenticating with an API token, it fetches the group's name and members, then lists all SQL warehouses in the workspace. For each warehouse, it checks and prints the permissions granted to either a specified user or a group, helping admins review access rights across the environment.\nExample code (Python)\nReplace the following before running this example code:\n\n(string) - This is your workspace URL. For example, myworkspace.cloud.databricks.com.\n\n(string) - Enter the\nOAuth access token\n(\nAWS\n|\nAzure\n|\nGCP\n) you want to use to make the API call.\n\n(int) - The group ID of the group you want to fetch the permissions for.\n[Optional]\n\n(string) - You can enter a username (example: user@domain.com) for a specific user if you want to fetch the permissions for that user.\nNote\nThis script uses OAuth as an authentication method because this is the Databricks-recommended method. The script also works with\nPAT\n(\nAWS\n|\nAzure\n|\nGCP\n).\n%python\r\n\r\nimport requests\r\nimport json\r\n\r\n\r\ninstance = # Replace with your workspace URL\r\napi_token =   # Replace with your API token\r\ngroup_id = #mention your group id\r\nusername = #optionally enter a username if you want to fetch permissions for a user\r\n\r\n\r\n\r\n# Set up request headers\r\nheaders = {\r\n       \"Authorization\": f\"Bearer {api_token}\",\r\n       \"Content-Type\": \"application/json\"\r\n   }\r\n  \r\n# Fetch the list of users from the mentioned group\r\nurl_users_list = f\"https://{instance}/api/2.0/preview/scim/v2/Groups/{group_id}\"\r\nresponse_users_list = requests.get(url_users_list, headers=headers)\r\nusers_list = response_users_list.json()['members']\r\ngroup_name = response_users_list.json()['displayName']\r\nprint(f\"Group Name: {group_name}\")\r\n  \r\n# Print the users in this group\r\nusers = [member['display'] for member in users_list]\r\nprint(\"Members in this group:\")\r\nfor user in users:\r\n   print(f\"Member (User/Group): {user}\")\r\n\r\n\r\ndef list_warehouses_and_permissions(api_token, group_id, username=None):\r\n   # Send the request to fetch the list of warehouses\r\n   url_warehouses_list = f\"https://{instance}/api/2.0/sql/warehouses\"\r\n   response_warehouses_list = requests.get(url_warehouses_list, headers=headers)\r\n   warehouses_response = response_warehouses_list.json()\r\n  \r\n   # Check if warehouses are present in the response\r\n   if \"warehouses\" not in warehouses_response:\r\n       print(\"No warehouses found in the response.\")\r\n       return\r\n  \r\n   warehouses_list = warehouses_response[\"warehouses\"]\r\n  \r\n   # Loop through the warehouse IDs and fetch the permissions for each one\r\n   for warehouse in warehouses_list:\r\n       warehouse_id = warehouse[\"id\"]\r\n       warehouse_name = warehouse[\"name\"]\r\n       #print(f\"\\nWarehouse ID: {warehouse_id}\")\r\n       #print(f\"Warehouse Name: {warehouse_name}\")\r\n      \r\n       # Send the request to fetch the permissions for the warehouse\r\n       url_permissions = f\"https://{instance}/api/2.0/permissions/warehouses/{warehouse_id}\"\r\n       response_permissions = requests.get(url_permissions, headers=headers)\r\n      \r\n       # Extract the JSON response\r\n       response_json = response_permissions.json()\r\n      \r\n       # Loop through the access control list and find the matching user_name/group_name\r\n       if 'access_control_list' in response_json:\r\n           for acl in response_json['access_control_list']:\r\n               if ('user_name' in acl and acl['user_name'] == username):\r\n                   print(f\"Access control list for user {username} on warehouse {warehouse_id}:\")\r\n                   print(json.dumps(acl['all_permissions'], indent=4))\r\n               elif ('group_name' in acl and acl['group_name'] == group_name):\r\n                   print(f\"Access control list for group {group_name} on warehouse {warehouse_id}:\")\r\n                   print(json.dumps(acl['all_permissions'], indent=4))\r\n       else:\r\n           print(\"No access control list found in the response.\")\r\n\r\n\r\n# Call the function to list warehouses and fetch permissions for them\r\nlist_warehouses_and_permissions(api_token, group_id, username)\nYou can use this example code as a reference to create a similar script for all other workspace resources. For more information, review the\nPermissions API\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-fetch-the-create-job-json-using-an-api-call-instead-of-the-ui.json b/scraped_kb_articles/how-to-fetch-the-create-job-json-using-an-api-call-instead-of-the-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..cdbe889608d1d3d66985c00f6cd50e4253d2f30c --- /dev/null +++ b/scraped_kb_articles/how-to-fetch-the-create-job-json-using-an-api-call-instead-of-the-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/how-to-fetch-the-create-job-json-using-an-api-call-instead-of-the-ui", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to fetch the CREATE job JSON from the\nJobs\nUI tab using an API call instead.\nCause\nThere is no dedicated API endpoint to retrieve the original CREATE job JSON. However, the\nsettings\nfield in the\n/api/2.1/jobs/get\nAPI contains all the necessary fields to reconstruct it.\nSolution\nUse the following example code to construct a GET job JSON file using the\n/api/2.1/jobs/get\nendpoint. The GET job includes the fields from a CREATE job as well as a\n“created_time”\nkey.\n{\r\n\r\n \"job_id\": ,\r\n\r\n \"creator_user_name\": \"\",\r\n\r\n \"run_as_user_name\": \"\",\r\n\r\n \"run_as_owner\": true,\r\n\r\n \"settings\": {\r\n\r\n  \"name\": \"\",\r\n\r\n  \"email_notifications\": {},\r\n\r\n  \"webhook_notifications\": {},\r\n\r\n  \"max_concurrent_runs\": 1,\r\n\r\n  \"tasks\": [\r\n\r\n   {\r\n\r\n    \"task_key\": \"\",\r\n\r\n    \"run_if\": \"ALL_SUCCESS\",\r\n\r\n    \"notebook_task\": {\r\n\r\n     \"notebook_path\": \"/Users//\",\r\n\r\n     \"source\": \"WORKSPACE\"\r\n\r\n    },\r\n\r\n    \"existing_cluster_id\": \"\",\r\n\r\n    \"timeout_seconds\": 0\r\n\r\n   }\r\n\r\n  ],\r\n\r\n  \"format\": \"MULTI_TASK\"\r\n\r\n },\r\n\r\n \"created_time\": 1742198198666\r\n\r\n}\nAdditional information\nTo compare outputs, the following code is a CREATE job JSON for the same job. Only the\n“created_time”\nkey is not included.\n{\r\n\r\n \"name\": \"\",\r\n\r\n \"email_notifications\": {},\r\n\r\n \"webhook_notifications\": {},\r\n\r\n \"max_concurrent_runs\": 1,\r\n\r\n \"tasks\": [\r\n\r\n  {\r\n\r\n   \"task_key\": \"\",\r\n\r\n   \"run_if\": \"ALL_SUCCESS\",\r\n\r\n   \"notebook_task\": {\r\n\r\n    \"notebook_path\": \"/Users//\",\r\n\r\n    \"source\": \"WORKSPACE\"\r\n\r\n   },\r\n\r\n   \"existing_cluster_id\": ,\r\n\r\n   \"timeout_seconds\": 0\r\n\r\n  }\r\n\r\n ],\r\n\r\n \"run_as\": {\r\n\r\n  \"user_name\": \"\"\r\n\r\n }\r\n\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-find-delta-table-name-based-on-table-id.json b/scraped_kb_articles/how-to-find-delta-table-name-based-on-table-id.json new file mode 100644 index 0000000000000000000000000000000000000000..ae2f618b861881e94d21b6e0d341af7721688023 --- /dev/null +++ b/scraped_kb_articles/how-to-find-delta-table-name-based-on-table-id.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/how-to-find-delta-table-name-based-on-table-id", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen an Apache Spark job using Unity Catalog tables fails, it throws an error message with the Delta table ID, but you also need the table name.\nError in handleErrors(returnStatus, conn) :\r\n  org.apache.spark.sql.AnalysisException: <>  detected when writing to the Delta table (Table ID: ).\nCause\nDisplaying only the Delta table ID is by design.\nSolution\nTo find the table name corresponding to a table ID:\nQuery the table\ntables\nin\nsystem.information_schema\nMatch the table ID from the Spark logs with the\nstorage_sub_directory\ncolumn.\n%sql\r\nselect table_catalog, table_schema, table_name \r\nfrom system.information_schema.tables where contains(storage_sub_directory, '')" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-find-the-last-accessed-details-for-a-table.json b/scraped_kb_articles/how-to-find-the-last-accessed-details-for-a-table.json new file mode 100644 index 0000000000000000000000000000000000000000..13af0afbc8a10f79c419dbbdc6a0de2a78147e7b --- /dev/null +++ b/scraped_kb_articles/how-to-find-the-last-accessed-details-for-a-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/how-to-find-the-last-accessed-details-for-a-table", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to find a table’s last accessed details for audit purposes.\nInstructions\nRun the following query. You can retrieve details about rows, but not columns.\nSELECT\r\n  max(event_time) as event_time,\r\n  request_params.table_full_name\r\n\r\nFROM system.access.audit\r\nWHERE \r\n  request_params.operation = 'READ'\r\nGROUP BY request_params.table_full_name\r\nORDER BY event_time DESC" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-gather-total-cluster-duration.json b/scraped_kb_articles/how-to-gather-total-cluster-duration.json new file mode 100644 index 0000000000000000000000000000000000000000..768a18999509adbcd00067b3f2efb8f9e0c4ff17 --- /dev/null +++ b/scraped_kb_articles/how-to-gather-total-cluster-duration.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/how-to-gather-total-cluster-duration", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to gather the total compute duration of a given cluster in their Databricks environment but don’t see a way to do it yourself.\nDatabricks provides system tables which you can query for various monitoring and reporting purposes. For more general information on system tables, refer to the\nMonitor account activity with system tables\n(\nAWS\n|\nAzure\n|\nGCP\n)  documentation.\nInstructions\nTo gather the total compute duration of a cluster, run the following SQL query in your SQL Analytics workspace. This query calculates the total cluster duration by summing the difference between the\nusage_start_time\nand\nusage_end_time\nfor each usage record associated with the specified cluster ID.\nReplace\n\nwith the actual cluster ID for which you want to gather the total cluster duration.\nSELECT \r\n    usage_metadata.,\r\n    MIN(usage_start_time) AS creation_time,\r\n    MAX(usage_end_time) AS last_usage_time,\r\n    SUM(TIMESTAMPDIFF(SECOND, usage_start_time, usage_end_time)) /3600 AS total_usage_hours\r\nFROM \r\n    system.billing.usage\r\nWHERE \r\n    usage_metadata.cluster_id = '' --optionally filter by clusterId\r\nGROUP BY \r\n    usage_metadata.cluster_id;" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-get-details-for-an-sql-query-run-in-a-notebook-on-an-all-purpose-cluster.json b/scraped_kb_articles/how-to-get-details-for-an-sql-query-run-in-a-notebook-on-an-all-purpose-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..1f6036a9fe87a409ff08ae62b963f34e7347cd88 --- /dev/null +++ b/scraped_kb_articles/how-to-get-details-for-an-sql-query-run-in-a-notebook-on-an-all-purpose-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/how-to-get-details-for-an-sql-query-run-in-a-notebook-on-an-all-purpose-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou need to find details of an SQL query run from a notebook on an all-purpose cluster but only see records for queries run using SQL warehouses or serverless compute in the query history table.\nSQL queries executed from a notebook using an all-purpose cluster run as Apache Spark jobs. Both SQL commands and commands executed in notebooks are processed using Spark behind the scenes. You can use verbose audit logs instead to get the query run details.\nInstructions\nEnable verbose audit logs. For instructions, refer to the\nEnable verbose audit logs\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCheck the\nsystem.access.audit\ntable for your query details. You can use the following example query to fetch the SQL query record.\nselect * from system.access.audit\r\nwhere service_name = \"notebook\"\r\nand request_params.notebookId = \"\"\r\nand event_date = \"\"" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-get-the-full-size-of-a-delta-table-or-partition.json b/scraped_kb_articles/how-to-get-the-full-size-of-a-delta-table-or-partition.json new file mode 100644 index 0000000000000000000000000000000000000000..a51813d6382687c1076a4c929c6b172da7701b50 --- /dev/null +++ b/scraped_kb_articles/how-to-get-the-full-size-of-a-delta-table-or-partition.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/how-to-get-the-full-size-of-a-delta-table-or-partition", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to get the full size of a Delta table or partition, rather than the current snapshot.\nNote\nFor instructions on getting the size of a table’s current snapshot, refer to the KB article\nFind the size of a table snapshot\n.\nCause\nYou want to understand the total size of storage data to better evaluate cost management, data lifecycle management, and table optimization choices.\nSolution\nUse the following Scala command. To get the size of the table, pass the root path. To get the size of a specific partition, specify the path to the partition.\n%scala\r\ndef findSizes(pathToTable: String): Double = {\r\n // Function to recursively get the size of all files in a directory\r\n def getAllFiles(path: String): Seq[com.databricks.backend.daemon.dbutils.FileInfo] = {\r\n   val filesAndDirs = dbutils.fs.ls(path)\r\n   filesAndDirs.flatMap { fileInfo =>\r\n     if (fileInfo.isDir) {\r\n       getAllFiles(fileInfo.path) // Recurse into subdirectories\r\n     } else {\r\n       Seq(fileInfo) // Collect the file\r\n     }\r\n   }\r\n }\r\n\r\n // Recursively collect all files from the given directory\r\n val allFiles = getAllFiles(pathToTable)\r\n // Sum the sizes of all the files and convert to MB\r\n val totalSize = allFiles.map(_.size).sum\r\n val sizeInMB = totalSize / (1024.0 * 1024.0) // Convert bytes to MB\r\n println(f\"Size of the table is $sizeInMB%.2f MB\") //The size in MB\r\n sizeInMB // Return the size in MB\r\n}\r\n\r\nprint(findSizes(\"dbfs:/\"))// Pass the root path to get size of entire table\r\nprint(findSizes(\"dbfs:/\"))// Path the path to the partition to get the size of specific partition." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-implement-strict-access-restrictions-on-system-tables.json b/scraped_kb_articles/how-to-implement-strict-access-restrictions-on-system-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..fdec1d0e27a1da8b52e3d3750747b7a1523a0f77 --- /dev/null +++ b/scraped_kb_articles/how-to-implement-strict-access-restrictions-on-system-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/how-to-implement-strict-access-restrictions-on-system-tables", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to implement strict access restrictions on system tables to limit user visibility exclusively to data associated with their designated workspace.\nBy default, system tables contain comprehensive data from all workspaces linked to a metastore. As a result, you cannot put restrictions directly on system tables.\nInstructions\nInstead, create a view on top of the system table using the following example.\nCREATE VIEW AS\r\nSELECT *\r\nFROM system.access.audit\r\nWHERE workspace_id = '';\nThen, grant privileges on the view to users who require access to this data. Note the owner of the view must have access to the underlying system table (for example,\nsystem.access.audit\n).\nFor details, review the “Requirements for querying views” section of the\nWhat is a view?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-install-python-packages-from-a-private-pypi-repository-on-databricks.json b/scraped_kb_articles/how-to-install-python-packages-from-a-private-pypi-repository-on-databricks.json new file mode 100644 index 0000000000000000000000000000000000000000..1c556219a27cbd3a68b28e12213de44987009778 --- /dev/null +++ b/scraped_kb_articles/how-to-install-python-packages-from-a-private-pypi-repository-on-databricks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/how-to-install-python-packages-from-a-private-pypi-repository-on-databricks", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou need to install Python libraries from a private PyPI repository that requires authentication.\nInstructions\nImportant\nBefore you begin, Databricks recommends never hardcoding credentials. Instead, use the secrets utility to manage credentials securely. For more information, refer to the\nSecret management\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nTo install a package from a private PyPI repository using a notebook command, either configure a cluster-wide index URL or install the package using a notebook cell.\nMethod 1: Configure a cluster-wide index URL\nIf you want persistent, cluster-wide configuration (so all\npip install\ncommands always use your private repo by default), you can configure pip itself. This method ensures all subsequent pip install calls on the cluster will automatically use your private index URL without specifying\n-i\neach time.\nYou can do this by either using\npip config\ndirectly, or creating an init script using\npip config\n.\nOption A: Use the pip config command directly\nSet these globally using pip config in a terminal.\npip config --global set global.index-url https://:@your-private-repo-url\r\npip config --global set global.extra-index-url \r\npip config --global set global.cert /etc/pip-certificates/cert.pem\nOption B: Create an init script using pip config\nThe following code provides an example of creating the init script using\npip config\nto persist the settings in your pip config file (for example,\n~/.config/pip/pip.conf\nor an equivalent). This approach is useful if you don’t want to modify\n/etc/pip.conf\n.\n#!/bin/bash\r\n# Make sure your CA cert is on the node\r\nmkdir -p /etc/pip-certificates\r\ncat </etc/pip-certificates/cert.pem\r\n-----BEGIN CERTIFICATE-----\r\n\r\n-----END CERTIFICATE-----\r\nEOF\r\n# Configure pip globally (using INIT script)\r\nOptionally add additional\nconfigs\nas needed\r\npip config --global set global.index-url https://:@your-private-repo-url\r\npip config --global set global.extra-index-url \r\npip config --global set global.cert /etc/pip-certificates/cert.pem\nMethod 2: Install the package using a notebook cell\nTo install the package using a notebook cell, you can either run code in a notebook or use an init script.\nOption A: Run code in a notebook\nBefore you run the code, replace the following variables with your own values.\n\nand\n\nwith your Databricks secret scope and key.\n\nin the format,\nartifactory1.example.com\n.\n\nwith the name of the package you want to install.\n\nin the required format. It may be your email, or it can be a string.\n%python\r\nuser = \"\" \r\npwd = dbutils.secrets.get(\"\", \"\")\r\nrepo_url = f\"https://{user}:{pwd}@\"\r\n\r\n# Install the package using pip magic\r\n%pip install -i {repo_url} ==\nIf SSL issues occur, such as certificate or hostname trust issues, add the --trusted-host argument.\n%sh\r\n%pip install -i {repo_url} --trusted-host \nOption B: Use an init script\nTo use an init script, review the\nInstall a private PyPI repo\nKB article.\nTroubleshooting\nIf the installation fails or behaves unexpectedly, use the following verbose logging option\n-vvv\nto get detailed output.\n%sh\r\n%pip install -i {repo_url} -vvv\nThis command prints detailed diagnostic information to help identify authentication errors, network issues, or pip incompatibilities.\nCommon Issues\nSymptom\nPossible cause\nSuggested fix\n401 Unauthorized\nInvalid credentials or incorrect secret scope/key\nVerify the secret scope and key name, then test with manually copied credentials.\nTimeout / SSL error\nRepository certificate not trusted\nAdd\n--trusted-host \nto the pip command.\nPackage not found\nIncorrect package name or repository URL\nCheck the private repo URL,  then validate the package name and version." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table.json b/scraped_kb_articles/how-to-list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table.json new file mode 100644 index 0000000000000000000000000000000000000000..deb577e0cc5f68f5a211b78a4dc050031b18e43a --- /dev/null +++ b/scraped_kb_articles/how-to-list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/how-to-list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to track the usage history (\nSELECT\nqueries) for specific tables, but encounter difficulties in identifying which users have executed\nSELECT\ncommands against arbitrary schemas or tables.\nInstructions\nUse audit logs to identify users who have performed\ngetTable\nactions against the specified tables. The audit logs are stored in the\nsystem.access.audit\ntable, which you can query using Apache Spark SQL.\nAdd your desired date range to the query\nevent_date\nparameter and include the full path to the schemas you wish to audit:\ncatalog.schema\n.\nSELECT \r\n    event_time,\r\n    event_date,\r\n    user_identity.email AS user_email,\r\n    service_name,\r\n    action_name,\r\n    request_params,\r\n    source_ip_address,\r\n    user_agent\r\nFROM system.access.audit\r\nWHERE action_name = 'getTable'\r\nAND (\r\n    request_params['full_name_arg'] LIKE '%%'\r\n    OR request_params['full_name_arg'] LIKE '%%'\r\n    OR request_params['full_name_arg'] LIKE '%%'\r\n)\r\nAND event_date BETWEEN '2025-02-01' AND '2025-03-04'\r\nORDER BY event_time DESC;" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-make-service-principals-and-groups-workspace-admins.json b/scraped_kb_articles/how-to-make-service-principals-and-groups-workspace-admins.json new file mode 100644 index 0000000000000000000000000000000000000000..444064988597c64c7a221c9ab8d639fda4fdbeae --- /dev/null +++ b/scraped_kb_articles/how-to-make-service-principals-and-groups-workspace-admins.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/how-to-make-service-principals-and-groups-workspace-admins", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-optimize-sql-commands-on-the-join-clause.json b/scraped_kb_articles/how-to-optimize-sql-commands-on-the-join-clause.json new file mode 100644 index 0000000000000000000000000000000000000000..35a94fd679854b2eb71045244d940f9d3e7e8a7c --- /dev/null +++ b/scraped_kb_articles/how-to-optimize-sql-commands-on-the-join-clause.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/how-to-optimize-sql-commands-on-the-join-clause", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to optimize the following SQL command’s performance.\nSELECT * FROM \r\nINNER JOIN ON . = . OR . = .\nAdditional context\nJOIN ON\nuses the\nPhotonBroadcastNestedLoopJoin\n. The nested loop approach in\nPhotonBroadcastNestedLoopJoin\ncan be slow for large datasets because it involves a Cartesian product-like operation (pairing every row from one dataset with every row from the other, creating all possible row-to-row combinations).\nInstructions\nChange your SQL code to use UNION instead.\nSELECT * FROM \r\nINNER JOIN ON . = .\r\nUNION\r\nSELECT * FROM \r\nINNER JOIN ON . = .\nUNION\nuses the\nPhotonBroadcastHashJoin\nto perform parallel joins, which are more efficient for joining large datasets.\nPhotonBroadcastHashJoin\nuses a hash join algorithm where one side of the join is broadcast to all nodes, and a hash table is built on the broadcast side. The other side of the join then probes this hash table to find matching rows." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-pass-parameters-from-workflow-to-a-sql-notebook.json b/scraped_kb_articles/how-to-pass-parameters-from-workflow-to-a-sql-notebook.json new file mode 100644 index 0000000000000000000000000000000000000000..75ce1acf1bfa9ec83e985f4459d544ceabd691b6 --- /dev/null +++ b/scraped_kb_articles/how-to-pass-parameters-from-workflow-to-a-sql-notebook.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/how-to-pass-parameters-from-workflow-to-a-sql-notebook", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to leverage Databricks syntax and configurations to ensure smooth parameter handling in your jobs even when using serverless SQL endpoints.\nInstructions\nCreate a SQL notebook and set up parameters, execute the query, then configure a job with matching parameter keys and corresponding values. When the values are triggered through the job, they are automatically passed to the notebook, enabling dynamic and parameter-driven SQL execution aligned with workflow settings.\nCreate a SQL notebook and define parameters\nCreate your notebook and set up the parameters in the UI. Parameter set up is like creating a variable. You define the name and the value type only.\n1. Navigate to your workspace.\n2. Click the\nCreate\nbutton in the top right corner and select\nNotebook\n.\n3. Within the notebook, click\nEdit\nin the top horizontal navigation, then choose\nAdd parameter…\nfrom the dropdown menu.\n4. Repeat steps 1-3 for each parameter you want to add.\n5. Clicking\nAdd parameter…\ncreates a field titled\nparam_1\n. Click the cog above the field on the right side to expand the parameter details and specify them.\nParameter name\n: the name you use to reference the widget in your code.\nWidget label\n: (Optional) UI label displayed above the widget for user clarity.\nWidget type\n: Specifies the widget input type. The choices are Text, Dropdown, Combobox, or Multiselect.\nParameter type\n: Declares the expected data type for the widget value, such as String or Int.\nThis only applies to resources attached to SQL warehouses.\nDefault parameter value\n: (Optional) Default value for the parameter.\n6. Click the cog again to close the parameter.\nExecute your query with defined parameter in a notebook\nIf you work with Databricks Runtimes 15.2 and above, use the following query structure to invoke the parameter you named in the previous step. Replace\n\nwith your parameter name.\nThe\n`IDENTIFIER()`\nclause is required to parse strings as object identifiers such as names for databases, tables, views, functions, columns, and fields.\nSELECT * FROM IDENTIFIER(:);\nIf you work with Databricks Runtimes 11.3 LTS - 14.3 LTS, use the following syntax in your notebook. Replace\n\nwith your parameter name.\nSELECT * FROM ${};\nSet up your job\nAfter you execute your query in a notebook, create a job for the notebook and provide the input parameters you set up in previously. The values created in the job, which can change by job, are passed in to each input parameter.\n1. Navigate to\nJobs & pipelines\n(formerly\nWorkflows\n) in the sidebar.\n2. Within\nJobs & pipelines\n, click the\nCreate\nbutton in the top right corner.\n3. Choose\nJob\nfrom the dropdown menu.\n4. Within the job creation UI:\nTask name\n: Choose your own task name.\nType\n: Select Notebook.\nSource\n: Select Workspace.\nPath\n: Specify the path of the notebook you created previously.\nCompute\n: Select Serverless.\nParameters\n: Provide the input parameter according to the parameter name added to the notebook previously.\n5. Click\nSave task\nto finish creating the job." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-prevent-a-job-hanging-at-an-external-api-call-step.json b/scraped_kb_articles/how-to-prevent-a-job-hanging-at-an-external-api-call-step.json new file mode 100644 index 0000000000000000000000000000000000000000..73dfb9f50ede6ed2183f6a78d5e13766c8ae168c --- /dev/null +++ b/scraped_kb_articles/how-to-prevent-a-job-hanging-at-an-external-api-call-step.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/how-to-prevent-a-job-hanging-at-an-external-api-call-step", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYour job is taking longer than expected to complete. When you review the code, the job appears to be stuck during an external API call step.\nInstructions\nThe API endpoint is down, experiencing network issues or other external factors affecting responsiveness. In your code, implement an API call timeout threshold and wrap API calls in try/except blocks for robust error handling.\nThe following code demonstrates how to use a timeout and a try/except block to cover\nrequests.Timeout\nand other potential request exceptions.\ntry:\r\n  response = requests.put(url, json=data, timeout=60)\r\n  response.raise_for_status() # Raise an error for non-2xx responses\r\nexcept requests.Timeout:\r\n  print(\"Request timed out\")\r\nexcept requests.RequestException as e:\r\n  print(f\"Request failed: {e}\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-programmatically-identify-account-admins-using-the-databricks-rest-api.json b/scraped_kb_articles/how-to-programmatically-identify-account-admins-using-the-databricks-rest-api.json new file mode 100644 index 0000000000000000000000000000000000000000..7be18063fc059480d544d455b9f2a97fa4813687 --- /dev/null +++ b/scraped_kb_articles/how-to-programmatically-identify-account-admins-using-the-databricks-rest-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/how-to-programmatically-identify-account-admins-using-the-databricks-rest-api", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-restrict-cluster-creation-to-single-node-only.json b/scraped_kb_articles/how-to-restrict-cluster-creation-to-single-node-only.json new file mode 100644 index 0000000000000000000000000000000000000000..4182f0712ba57577437e077ac409b95ede82852b --- /dev/null +++ b/scraped_kb_articles/how-to-restrict-cluster-creation-to-single-node-only.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/how-to-restrict-cluster-creation-to-single-node-only", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to restrict other users to creating only single-node clusters, in order to conduct testing or in situations where multi-node clusters are not needed.\nWhile Databricks allows users to configure cluster settings through the UI, there is no option to enforce single-node clusters for all users.\nCause\nThe UI allows setting up single-node clusters manually, but it does not prevent users from modifying settings to create multi-node clusters.\nSolution\nWhen you create a compute policy and define cluster actions, add the following JSON configuration to enforce single-node cluster creation by default, and help restrict the creation of unnecessary multi-node clusters.\n{\r\n  \"spark_conf.spark.databricks.cluster.profile\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"singleNode\"\r\n  },\r\n  \"spark_conf.spark.master\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"local[*]\"\r\n  },\r\n  \"num_workers\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": 0\r\n  },\r\n  \"custom_tags.ResourceClass\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"SingleNode\"\r\n  }\r\n}\nFor more information, refer to the\nCreate and manage compute policies\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nTo add more attributes or customize the policy further, refer to the\nCompute policy reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-restrict-selection-of-specific-databricks-runtimes-in-the-compute-creation-ui.json b/scraped_kb_articles/how-to-restrict-selection-of-specific-databricks-runtimes-in-the-compute-creation-ui.json new file mode 100644 index 0000000000000000000000000000000000000000..52603b10a31acb62c24ecac8e7a6e25de5a47b08 --- /dev/null +++ b/scraped_kb_articles/how-to-restrict-selection-of-specific-databricks-runtimes-in-the-compute-creation-ui.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/how-to-restrict-selection-of-specific-databricks-runtimes-in-the-compute-creation-ui", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn the compute configuration UI, runtimes for the\nML\nand\nStandard\ncategories are displayed by default, but you want to restrict users to only configurations from a predefined category (for example, just\nStandard\n).\nCause\nIn the user interface, multiple categories are displayed by default when you create a cluster, each with several choices. You want to avoid accidentally selecting configurations that don't align with an organization's requirements.\nSolution\nTo ensure only valid runtime selections are available to select, use the appropriate regex pattern based on the runtime category or configuration type.\nFor Standard Runtimes\nTo limit the selection to standard runtime versions, define the following regex pattern in the compute policy.\n{\r\n  \"spark_version\": {\r\n    \"type\": \"regex\",\r\n    \"pattern\": \"(([1-9])|([0-9][0-9])).[0-9].x-scala[0-9].[0-9][0-9]\"\r\n  }\r\n}\nFor ML Runtimes\nTo limit the selection to ML runtime versions, define the following regex pattern in the compute policy.\n{\r\n  \"spark_version\": {\r\n    \"type\": \"regex\",\r\n    \"pattern\": \"(([1-9])|([0-9][0-9])).[0-9].x-cpu-ml-scala[0-9].[0-9][0-9]\"\r\n  }\r\n}\nFor more details on defining compute policies, refer to the\nCreate and manage compute policies\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-retrieve-dlt-pipeline-details-using-python-and-the-databricks-api.json b/scraped_kb_articles/how-to-retrieve-dlt-pipeline-details-using-python-and-the-databricks-api.json new file mode 100644 index 0000000000000000000000000000000000000000..c05645fe890512c11ff9d06f01615a99da89a87f --- /dev/null +++ b/scraped_kb_articles/how-to-retrieve-dlt-pipeline-details-using-python-and-the-databricks-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/how-to-retrieve-dlt-pipeline-details-using-python-and-the-databricks-api", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nWhen working with DLT pipelines, you want to retrieve DLT pipeline details, including the following.\nSource code (notebooks or file paths)\nConfiguration parameters\nPermissions associated with the pipeline\nYou want to retrieve pipeline details to be able to do the following.\nAudit pipeline configurations.\nExport DLT definitions to CI/CD systems.\nReview access controls for governance.\nInstructions\nAdapt and run the following Python code in a notebook to retrieve DLT pipeline details.\nimport requests\r\nimport logging\r\n\r\n# === Configure logging ===\r\nlogging.basicConfig(\r\n    level=logging.INFO,\r\n    format=\"%(asctime)s [%(levelname)s] %(message)s\"\r\n)\r\n\r\n# === Set your environment variables ===\r\nDATABRICKS_INSTANCE = \"https://.cloud.databricks.com\"\r\nDATABRICKS_TOKEN = \"dapi\"\r\nPIPELINE_ID = \"\"\r\n\r\nheaders = {\r\n    \"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\"\r\n}\r\n\r\n# === Get pipeline info ===\r\npipeline_url = f\"{DATABRICKS_INSTANCE}/api/2.0/pipelines/{PIPELINE_ID}\"\r\nresponse = requests.get(pipeline_url, headers=headers)\r\nresponse.raise_for_status()\r\npipeline_info = response.json()\r\n\r\nlogging.info(\"Pipeline Name: %s\", pipeline_info.get(\"name\"))\r\n\r\nlogging.info(\"Source Code Libraries:\")\r\nfor lib in pipeline_info[\"spec\"].get(\"libraries\", []):\r\n    if \"notebook\" in lib:\r\n        logging.info(\"  - Notebook: %s\", lib[\"notebook\"][\"path\"])\r\n    elif \"file\" in lib:\r\n        logging.info(\"  - File: %s\", lib[\"file\"][\"path\"])\r\n\r\nlogging.info(\"Configuration Parameters:\")\r\nfor k, v in pipeline_info[\"spec\"].get(\"configuration\", {}).items():\r\n    logging.info(\"  - %s: %s\", k, v)\r\n\r\n# === Get pipeline permissions ===\r\npermissions_url = f\"{DATABRICKS_INSTANCE}/api/2.0/permissions/pipelines/{PIPELINE_ID}\"\r\nperm_response = requests.get(permissions_url, headers=headers)\r\nperm_response.raise_for_status()\r\npermissions = perm_response.json()\r\n\r\nlogging.info(\"Pipeline Permissions:\")\r\nfor acl in permissions.get(\"access_control_list\", []):\r\n    principal = (\r\n        acl.get(\"user_name\") or\r\n        acl.get(\"group_name\") or\r\n        acl.get(\"service_principal_name\")\r\n    )\r\n    perms = [p[\"permission_level\"] for p in acl.get(\"all_permissions\", [])]\r\n    logging.info(\"  - %s: %s\", principal, \", \".join(perms))" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-retrieve-parquet-file-metadata.json b/scraped_kb_articles/how-to-retrieve-parquet-file-metadata.json new file mode 100644 index 0000000000000000000000000000000000000000..e596818e3820d2faeea0fb4cc5c60426c48a4e75 --- /dev/null +++ b/scraped_kb_articles/how-to-retrieve-parquet-file-metadata.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/how-to-retrieve-parquet-file-metadata", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to collect metadata from your Parquet files such as total rows, number of row groups, and per-row group details like row count and size. You want to be able to debug, validate, and build robust, scalable data pipelines.\nInstructions\nUse the following code to extract the needed information from files stored on Databricks File System (DBFS) or cloud storage with a mount path.\nPass the path of your file for which you want to extract the total row group, total rows, and details for each row group.\nimport pyarrow.parquet as pq\r\n\r\n# Replace with the actual parquet file path.\r\npath = \"\"\r\nparquet_file = pq.ParquetFile(path)\r\nmetadata = parquet_file.metadata\r\n\r\nprint(f\"Total Row Groups: {parquet_file.num_row_groups}\")\r\nprint(f\"Total Rows: {metadata.num_rows}\")\r\n\r\nfor i in range(parquet_file.num_row_groups):\r\n    row_group = metadata.row_group(i)\r\n    size_bytes = row_group.total_byte_size  # Get size in bytes\r\n\r\n    # Convert bytes to KB, MB, GB\r\n    size_kb = size_bytes / 1024\r\n    size_mb = size_kb / 1024\r\n    size_gb = size_mb / 1024\r\n\r\n    print(f\"Row Group {i}:\")\r\n    print(\r\n        f\" - Size: {size_bytes} bytes ({size_kb:.2f} KB, {size_mb:.2f} MB, {size_gb:.4f} GB)\"\r\n    )\r\n    print(f\" - Rows: {row_group.num_rows}\")\nFor information the Parquet file format, refer to the\nRead Parquet files using Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor information on what Parquet files are, refer to\nParquet\nin the Databricks glossary." +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-specify-dbfs-path.json b/scraped_kb_articles/how-to-specify-dbfs-path.json new file mode 100644 index 0000000000000000000000000000000000000000..381ebb66b40632013b549cb6f266597eb71b98e2 --- /dev/null +++ b/scraped_kb_articles/how-to-specify-dbfs-path.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/how-to-specify-dbfs-path", + "title": "Título do Artigo Desconhecido", + "content": "When working with Databricks you will sometimes have to access the Databricks File System (DBFS).\nAccessing files on DBFS is done with standard filesystem commands, however the syntax varies depending on the language or tool used.\nFor example, take the following DBFS path:\ndbfs:/mnt/test_folder/test_folder1/\nApache Spark\nUnder Spark, you should specify the full path inside the Spark read command.\nspark.read.parquet(“dbfs:/mnt/test_folder/test_folder1/file.parquet”)\nDBUtils\nWhen you are using DBUtils, the full DBFS path should be used, just like it is in Spark commands. The language specific formatting around the DBFS path differs depending on the language used.\nBash\n%fs\r\nls dbfs:/mnt/test_folder/test_folder1/\nPython\n%python\r\n\r\ndbutils.fs.ls(‘dbfs:/mnt/test_folder/test_folder1/’)\nScala\n%scala\r\n\r\ndbutils.fs.ls(“dbfs:/mnt/test_folder/test_folder1/”)\nDelete\nNote\nSpecifying\ndbfs:\nis not required when using DBUtils or Spark commands. The path\ndbfs:/mnt/test_folder/test_folder1/\nis equivalent to\n/mnt/test_folder/test_folder1/\n.\nShell commands\nShell commands do not recognize the DFBS path. Instead, DBFS and the files within, are accessed with the same syntax as any other folder on the file system.\nBash\nls /dbfs/mnt/test_folder/test_folder1/\r\ncat /dbfs/mnt/test_folder/test_folder1/file_name.txt\nPython\nimport os\r\nos.listdir('/dbfs/mnt/test_folder/test_folder1/’)\nScala\nimport java.io.File\r\nval directory = new File(\"/dbfs/mnt/test_folder/test_folder1/\")\r\ndirectory.listFiles" +} \ No newline at end of file diff --git a/scraped_kb_articles/how-to-use-custom-image-in-a-dlt-pipeline.json b/scraped_kb_articles/how-to-use-custom-image-in-a-dlt-pipeline.json new file mode 100644 index 0000000000000000000000000000000000000000..f501f96f9847c8c51594046d1057bbda38100076 --- /dev/null +++ b/scraped_kb_articles/how-to-use-custom-image-in-a-dlt-pipeline.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/how-to-use-custom-image-in-a-dlt-pipeline", + "title": "Título do Artigo Desconhecido", + "content": "Introduction\nYou want to use a custom image in a DLT pipeline.\nInstructions\n1. From your DLT pipeline UI, click\nSettings\n.\n2. Click the\nJSON\nbutton in the pipeline settings view.\n3. Add a new tag using\n\"\"\nwith the custom image name (prefixed with\ncustom:\n). The following code provides an example.\n\"dbr_version\": \"custom:15.4.16-delta-pipelines-dlt-release-dp-2025.23-rc0-commit-XXXXXXX-image-XXXXXXX\",\n\"dbr_version\"\nshould be a JSON element at the same level as the\n“channel”\n, as demonstrated in the following screenshot.\n4. Click\nSave.\n5. Validate the\n\"Databricks Runtime Version\"\nin the cluster details of the running pipeline." +} \ No newline at end of file diff --git a/scraped_kb_articles/howto-jobsdeleterestapi.json b/scraped_kb_articles/howto-jobsdeleterestapi.json new file mode 100644 index 0000000000000000000000000000000000000000..f8db043c9276d8a1f00343d83aaa065649052be4 --- /dev/null +++ b/scraped_kb_articles/howto-jobsdeleterestapi.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/howto-jobsdeleterestapi", + "title": "Título do Artigo Desconhecido", + "content": "Run the following commands to delete all jobs in a Databricks workspace.\nIdentify the jobs to delete and list them in a text file:\n%sh\r\n\r\ncurl -X GET -u \"Bearer: \" https:///api/2.0/jobs/list | grep -o -P 'job_id.{0,6}' | awk -F':' '{print $2}' >> job_id.txt\nRun the\ncurl\ncommand in a loop to delete the identified jobs:\n%sh\r\n\r\nwhile read line\r\ndo\r\njob_id=$line\r\ncurl -X POST -u \"Bearer: \" https:///api/2.0/jobs/delete -d '{\"job_id\": '\"$job_id\"'}'\r\ndone < job_id.txt" +} \ No newline at end of file diff --git a/scraped_kb_articles/http-500-error-in-the-apache-spark-ui-sql-dataframe.json b/scraped_kb_articles/http-500-error-in-the-apache-spark-ui-sql-dataframe.json new file mode 100644 index 0000000000000000000000000000000000000000..21f2345db46cde5098673c2db9f9c87dd242c54e --- /dev/null +++ b/scraped_kb_articles/http-500-error-in-the-apache-spark-ui-sql-dataframe.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/http-500-error-in-the-apache-spark-ui-sql-dataframe", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to navigate to your cluster >\nSpark UI\n>\nSQL/DataFrame\n, you notice the physical plan is not logging. When you try to access the\nSQL/DataFrame\nsection, you receive the following error.\nHTTP ERROR 500 java.util.NoSuchElementException: 3?__main__?+=0000000000000158?+=0000000000000000\r\nCaused by:\r\njava.util.NoSuchElementException: 3?__main__?+=0000000000000158?+=0000000000000000\r\nat org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:133)\r\nat org.apache.spark.util.kvstore.LevelDB.read(LevelDB.java:147)\r\nat org.apache.spark.util.kvstore.DatabricksHybridStore.read(DatabricksHybridStore.scala:37)\r\nat org.apache.spark.sql.execution.ui.SQLAppStatusStore.planGraph(SQLAppStatusStore.scala:81)\r\nat org.apache.spark.sql.execution.ui.EdgeExecutionPage.$anonfun$render$14(ExecutionPage.scala:413)\r\nat scala.Option.map(Option.scala:230)\r\nat org.apache.spark.sql.execution.ui.EdgeExecutionPage.render(ExecutionPage.scala:347)\r\nat org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:109)\r\nat org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:101)\nCause\nYour\nsparkPlanInfo\nstructure and its nested children exceed the default size limit of 2MB (2097152 bytes) for a single event in Apache Spark's event logging. This limit is controlled by the\nspark.eventLog.unknownRecord.maxSize\nconfiguration.\nSolution\nIncrease the maximum size for unknown records to 16MB by adding the following Spark config to your cluster.\nspark.eventLog.unknownRecord.maxSize 16m\nFor details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nImportant\nIncreasing the maximum size to 16MB increases the likelihood of more memory being consumed. Depending on your workflow, you may require a change to a cluster with more memory." +} \ No newline at end of file diff --git a/scraped_kb_articles/http-error-401-when-trying-to-access-the-adls-mount-path-from-databricks.json b/scraped_kb_articles/http-error-401-when-trying-to-access-the-adls-mount-path-from-databricks.json new file mode 100644 index 0000000000000000000000000000000000000000..ad34dcf42ea1eec0aceab002018037858d8c381f --- /dev/null +++ b/scraped_kb_articles/http-error-401-when-trying-to-access-the-adls-mount-path-from-databricks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/http-error-401-when-trying-to-access-the-adls-mount-path-from-databricks", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to access mount paths backed by an Azure Data Lake Storage (ADLS) Gen2 storage account, you receive the following error message.\nDatabricksServiceException: IO_ERROR: HTTP Error 401; url='https://login.microsoftonline.com/xxxxxxx-xxxx-xxxx-xxx-xxxxxxxxxxx/oauth2/token' AADToken: HTTP connection to https://login.microsoftonline.com/xxxxxxx-xxxx-xxxx-xxx-xxxxxxxxxxx/oauth2/token failed for getting token from AzureAD.; requestId=''; contentType='application/json; charset=utf-8'; response '{'error':'invalid_client','error_description':'AADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app ''. Trace ID: Correlation ID: Timestamp: 2025-06-03 14:45:15Z','error_codes':[7000215],'timestamp':'2025-06-03 14:45:15Z','trace_id':'','correlation_id':'','error_uri':'https://login.microsoftonline.com;\nCause\nThe client secret used for authenticating a service principal (SP) with Azure Storage has expired.\nWhen the secret expires, Azure Active Directory (AD) no longer accepts the authentication request and the mount configuration becomes invalid, resulting in the observed HTTP Error 401.\nSolution\nGenerate a new client secret for the service principal (SP) in Azure Entra ID (formerly Azure Active Directory). Securely save the new secret, as it will be used in subsequent steps.\nIn a Databricks notebook, unmount the storage account mount path that was configured using the SP with the expired secret. Execute\ndbutils.fs.unmount(\"/mnt/\")\nAfter unmounting, run\ndbutils.fs.refreshMounts()\non all other running clusters to ensure that the mount changes are propagated.\nRecreate the mount path using the same SP but with the new client secret. Update your mount configuration with the new secret and execute the mount command. For details refer to the\nMounting cloud object storage on Azure Databricks\ndocumentation.\nOnce remounted, run\ndbutils.fs.refreshMounts()\non all running clusters again to propagate the updated mount configuration, ensuring consistency." +} \ No newline at end of file diff --git a/scraped_kb_articles/hyperopt-fail-maxnumconcurrenttasks.json b/scraped_kb_articles/hyperopt-fail-maxnumconcurrenttasks.json new file mode 100644 index 0000000000000000000000000000000000000000..deabd1da0d1f5b36ee118a325540023e7fa2dfb6 --- /dev/null +++ b/scraped_kb_articles/hyperopt-fail-maxnumconcurrenttasks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/hyperopt-fail-maxnumconcurrenttasks", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are tuning machine learning parameters using\nHyperopt\nwhen your job fails with a\npy4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist\nerror.\nYou are using a Databricks Runtime for Machine Learning (Databricks Runtime ML) cluster.\nCause\nDatabricks Runtime ML has a compatible version of Hyperopt pre-installed (\nAWS\n|\nAzure\n|\nGCP\n).\nIf you manually install a second version of Hyperopt, it causes a conflict.\nSolution\nDo not install Hyperopt on Databricks Runtime ML clusters." +} \ No newline at end of file diff --git a/scraped_kb_articles/iceberg-metadata-not-reflecting-latest-changes-made-to-table-appearing-out-of-sync.json b/scraped_kb_articles/iceberg-metadata-not-reflecting-latest-changes-made-to-table-appearing-out-of-sync.json new file mode 100644 index 0000000000000000000000000000000000000000..bf4226562c4547d4a57f0fb8ce1ad19f685e6ed2 --- /dev/null +++ b/scraped_kb_articles/iceberg-metadata-not-reflecting-latest-changes-made-to-table-appearing-out-of-sync.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/iceberg-metadata-not-reflecting-latest-changes-made-to-table-appearing-out-of-sync", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using UniForm to share datasets with clients who do not have a Databricks instance, you notice your Iceberg metadata does not reflect the latest changes made to the Delta table.\nYou observe the issue when ingesting data to Iceberg tables using in-house pipelines, where the Iceberg metadata reflects a one-change delay.\nCause\nWhen data is ingested into a Delta table, the changes are immediately reflected in the Delta log. However, the Iceberg metadata update is an asynchronous process that takes a minimum of a few hours after the ingestion job is completed to reflect changes.\nYour job terminates right after pushing the Delta changes, and the async Iceberg metadata update does not have enough time to complete.\nSolution\nAdd the\nMSCK REPAIR TABLE SYNC METADATA\ncommand at the end of the ingestion job to ensure the Iceberg metadata updates synchronously with each ingestion, preventing delays in reflecting the latest changes.\nAlternatively, you can manually run the\nMSCK REPAIR TABLE SYNC METADATA\ncommand after each ingestion.\nFor details, refer to the “Manually trigger Iceberg metadata conversion” section of the\nRead Delta tables with Iceberg clients\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/id-duplicate-on-append.json b/scraped_kb_articles/id-duplicate-on-append.json new file mode 100644 index 0000000000000000000000000000000000000000..a33b79cb3e3e7b0bfb8f22957040275ed9e6feba --- /dev/null +++ b/scraped_kb_articles/id-duplicate-on-append.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/id-duplicate-on-append", + "title": "Título do Artigo Desconhecido", + "content": "A common issue when performing append operations on Delta tables is duplicate data.\nFor example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table.\nIn this article, we review basic troubleshooting steps that you can use to identify duplicate records, as well as the user name, and notebooks or jobs that resulted in the duplicate data.\nIdentify columns with duplicate records\n%sql\r\n\r\nselect count(*) as count, from group by order by \nThe output identifies all columns with duplicate data.\nIdentify input files with duplicate data\nSelect a data point from the previous query and use it to determine which files provided duplicate data.\n%sql\r\n\r\nselect *, input_file_name() as path from where =\nThe output includes a column called\npath\n, which identifies the full path to each input file.\nIdentify the location table\n%sql\r\n\r\ndescribe table extended \nUse the location table results to search for parquet paths\n%sh\r\n\r\ngrep -r 'part-.snappy.parquet' /dbfs/user/hive/warehouse//_delta_log\n%sh\r\n\r\ngrep -r 'part-/_delta_log\nThe results allow you to identify the impacted Delta versions.\nCheck the Delta history for the impacted versions\n%sql\r\n\r\nselect * from (describe history ) t where t.version In(0,1)\nThe Delta history results provide the user name, as well as the notebook or job id that caused the duplicate to appear in the Delta table.\nNow that you have identified the source of the duplicate data, you can modify the notebook or job to prevent it from happening.\nExample notebook\nReview the\nIdentify duplicate data on append example\nnotebook." +} \ No newline at end of file diff --git a/scraped_kb_articles/identity-federation-is-not-enabled-in-workspaces-created-with-terraform.json b/scraped_kb_articles/identity-federation-is-not-enabled-in-workspaces-created-with-terraform.json new file mode 100644 index 0000000000000000000000000000000000000000..b29b24eeba6517ffa1aac722a1cb7b65fcf62812 --- /dev/null +++ b/scraped_kb_articles/identity-federation-is-not-enabled-in-workspaces-created-with-terraform.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/terraform/identity-federation-is-not-enabled-in-workspaces-created-with-terraform", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating a workspace via Terraform, the\nidentity federation setting\n(\nAWS\n|\nAzure\n|\nGCP\n) is not enabled by default.\nCause\nIn order to enable identity federation, your workspace needs to belong to a metastore. Only new accounts that are created after November 8, 2023, have Unity Catalog and identity federation enabled by default. Accounts created before then have to manually enable the feature.\nSolution\nTo enable identity federation via Terraform when the workspace is created, you must specify the\nmetastore_assignment\nattribute when creating the workspace in order to assign it to the metastore.\nExample code\nresource \"databricks_metastore_assignment\" \"\" {\r\nworkspace_id = \r\nmetastore_id = \r\n}\nYou need to enter the values for your\nworkspace ID\nand your\nmetastore ID\nto the example code. You will also need to enter your metastore name.\nIf you are using Terraform\nvariables\nto manage the value, set the attributes to those variables.\nFor example,\nworkspace_id = \n.\nFor more information, review the Terraform\ndatabricks_metastore_assignment\ndocumentation.\nInfo\nYou can also\nassign a workspace to a metastore\n(\nAWS\n|\nAzure\n|\nGCP\n) with an API call." +} \ No newline at end of file diff --git a/scraped_kb_articles/idle-clusters-causing-inefficient-resource-use-and-increased-costs.json b/scraped_kb_articles/idle-clusters-causing-inefficient-resource-use-and-increased-costs.json new file mode 100644 index 0000000000000000000000000000000000000000..6ed78101ee784cd807096845bfd60d356046b9d6 --- /dev/null +++ b/scraped_kb_articles/idle-clusters-causing-inefficient-resource-use-and-increased-costs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/idle-clusters-causing-inefficient-resource-use-and-increased-costs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a streaming job that runs continuously, processing batches of data as they arrive. However, the time interval between batches can vary significantly, ranging from one minute to two hours, leaving the cluster idle for extended periods throughout the day. You experience inefficient resource usage and increased costs.\nCause\nYou have not configured your streaming job to run only when new batches of data are available.\nSolution\nUse the\nTrigger type - File arrival\non your streaming jobs with the\n.trigger(availableNow=True)\nat a streaming level.\nFile arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location.\nTo use file arrival triggers you must:\nEnsure your workspace has Unity Catalog enabled\nUse a storage location that’s either a Unity Catalog volume or an external location added to the Unity Catalog metastore\nHave\nREAD\npermissions to the storage location and\nCAN MANAGE\npermissions on the job\nIf your workspace is not onboarded into Unity Catalog, you can still avoid running in continuous mode by scheduling your jobs to run every 15-20 minutes and reading all available files.\nFor more information, please refer to the\nTrigger jobs when new files arrive\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/import-custom-ca-cert.json b/scraped_kb_articles/import-custom-ca-cert.json new file mode 100644 index 0000000000000000000000000000000000000000..535f9de495c71324a53fbff144555d4dd326e88f --- /dev/null +++ b/scraped_kb_articles/import-custom-ca-cert.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/import-custom-ca-cert", + "title": "Título do Artigo Desconhecido", + "content": "When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints.\nConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 110] Connection timed out',))\nSimilarly, you may need custom certificates to be added to the default Java cacerts in order to access different endpoints with Apache Spark JVMs.\nInstructions\nTo import one or more custom CA certificates to your Databricks compute, you can create an init script that adds the entire CA certificate chain to both the Linux SSL and Java default cert stores, and sets the\nREQUESTS_CA_BUNDLE\nproperty.\nThe resulting init script can be configured as a cluster-scoped init script or a global init script.\nFor more information on configuring init scripts, please review the\nWhat are init scripts?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIn this example init script, PEM format CA certificates are added to the file\nmyca.crt\nwhich is located at\n/user/local/share/ca-certificates/\n.\nYou should replace the values\n\nand\n\nin the example code with your custom certificate details. You can add as many custom certificates as you need.\n#!/bin/bash\r\n\r\ncat << 'EOF' > /usr/local/share/ca-certificates/myca.crt\r\n-----BEGIN CERTIFICATE-----\r\n\r\n-----END CERTIFICATE-----\r\n-----BEGIN CERTIFICATE-----\r\n\r\n-----END CERTIFICATE-----\r\nEOF\r\n\r\nupdate-ca-certificates\r\n\r\nPEM_FILE=\"/etc/ssl/certs/myca.pem\"\r\nPASSWORD=\"changeit\"\r\nJAVA_HOME=$(readlink -f /usr/bin/java | sed \"s:bin/java::\")\r\nKEYSTORE=\"$JAVA_HOME/lib/security/cacerts\"\r\n\r\nCERTS=$(grep 'END CERTIFICATE' $PEM_FILE| wc -l)\r\n\r\n# To process multiple certs with keytool, you need to extract\r\n# each one from the PEM file and import it into the Java KeyStore.\r\n\r\nfor N in $(seq 0 $(($CERTS - 1))); do\r\n ALIAS=\"$(basename $PEM_FILE)-$N\"\r\n echo \"Adding to keystore with alias:$ALIAS\"\r\n cat $PEM_FILE |\r\n awk \"n==$N { print }; /END CERTIFICATE/ { n++ }\" |\r\n keytool -noprompt -import -trustcacerts \\\r\n -alias $ALIAS -keystore $KEYSTORE -storepass $PASSWORD\r\ndone\r\n\r\necho \"export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt\" >> /databricks/spark/conf/spark-env.sh\r\necho \"export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt\" >> /databricks/spark/conf/spark-env.sh\nTroubleshooting\nIf you get a error message like\nbash: line : $'\\r': command not found\nor\nbash: line : warning: here-document at line 3 delimited by end-of-file (wanted `EOF')\n, you may have Windows style new line characters present.\nUse\ncat -v \nto view the file and look for any hidden characters.\nIf you do have unwanted new line characters present, you can use a utility like\ndos2unix\nto convert the file to a standard *nix-style format." +} \ No newline at end of file diff --git a/scraped_kb_articles/inconsistent-timestamp-results-jdbc.json b/scraped_kb_articles/inconsistent-timestamp-results-jdbc.json new file mode 100644 index 0000000000000000000000000000000000000000..c0b9a693665305d04de29e2ba0d277d0b8f9012a --- /dev/null +++ b/scraped_kb_articles/inconsistent-timestamp-results-jdbc.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/inconsistent-timestamp-results-jdbc", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using JDBC applications with Databricks clusters you see inconsistent\njava.sql.Timestamp\nresults when switching between standard time and daylight saving time.\nCause\nDatabricks clusters use UTC by default.\njava.sql.Timestamp\nuses the JVM’s local time zone.\nIf a Databricks cluster returns\n2021-07-12 21:43:08\nas a string, the JVM parses it as\n2021-07-12 21:43:08\nand assumes the time zone is local.\nThis works normally for most of the year, but when the local time zone has a DST adjustment, it causes an issue as UTC does not change.\nFor example, on March 14, 2021, the US switched from standard time to daylight saving time. This means that local time went from 1:59 am to 3:00 am.\nIf a Databricks cluster returns\n2021-03-14 02:10:55\n, the JVM automatically converts it to\n2021-03-14 03:10:55\nbecause\n02:10:55\ndoes not exist in local time on that date.\nSolution\nOption 1\n: Configure the JVM time zone to UTC.\nSet the\nuser.timezone\nproperty to\nGMT\n.\nReview the\nJava time zone settings\ndocumentation for more information.\nOption 2\n: Use ODBC instead of JDBC. ODBC interprets timestamps as UTC.\nInstall the\nDatabricks ODBC Driver\n.\nConnect pyodbc (\nAWS\n|\nAzure\n|\nGCP\n) to Databricks.\nYou can also use\nturbodbc\n.\nOption 3\n: Set the local time zone to UTC in your JDBC application.\nReview the documentation for your JDBC application to learn how to configure the local time zone settings." +} \ No newline at end of file diff --git a/scraped_kb_articles/incorrect-input-record-count-in-apache-spark-streaming-application-logsmicro-batch-metrics.json b/scraped_kb_articles/incorrect-input-record-count-in-apache-spark-streaming-application-logsmicro-batch-metrics.json new file mode 100644 index 0000000000000000000000000000000000000000..cd33ca3a5d882403c43269afac9f1b6005b46f35 --- /dev/null +++ b/scraped_kb_articles/incorrect-input-record-count-in-apache-spark-streaming-application-logsmicro-batch-metrics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/incorrect-input-record-count-in-apache-spark-streaming-application-logsmicro-batch-metrics", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen observing the logs of Spark Streaming applications, you notice the metric\nnumInputRows\nin the micro batch metrics or number of input records,\nevent.progress.numInputRows\n, logged using the\nStreamingQueryListener\ndoes not match the expected count. This leads to confusion about the actual number of records being processed.\nCause\nA discrepancy in the input record count happens when multiple actions are triggered on the same DataFrame within the\nforeachBatch\nfunction.\nWhen an action is called more than once on the same stream, the\nnumInputRows\nvalue counts the total number of records read for\nall\nactions in the code. If actions like\ndf.count()\nare called multiple times, the same data is counted again, leading to an inflated record count in the logs. The inflated value appears in streaming micro batch metrics and\nqueryListeners\n.\nSolution\nOptimize actions on the DataFrame within the\nforeachBatch\nfunction.\nThe recommended approach is to cache the DataFrame before performing any actions and unpersist it afterward to avoid resource leaks.\nCache the DataFrame using\ndf.cache()\nbefore performing any action.\nPerform the necessary actions on the cached DataFrame.\nUnpersist the DataFrame using\ndf.unpersist()\nafter the actions are completed to free up resources.\nFor more information, please refer to the\nApache Spark Class DataFrame\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/incorrect-results-docs-as-input.json b/scraped_kb_articles/incorrect-results-docs-as-input.json new file mode 100644 index 0000000000000000000000000000000000000000..4bbc1e8b559257fe15607222e0f6f15c2f47fc93 --- /dev/null +++ b/scraped_kb_articles/incorrect-results-docs-as-input.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/incorrect-results-docs-as-input", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a ML model that takes documents as inputs, specifically, an array of strings.\nYou use a feature extractor like\nTfidfVectorizer\nto convert the documents to an array of strings and ingest the array into the model.\nThe model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JSON inputs.\nCause\nTfidfVectorizer\nexpects an array of documents as an input.\nDatabricks converts inputs to Pandas DataFrames, which\nTfidfVectorizer\ndoes not process correctly.\nSolution\nYou must create a custom transformer and add it to the head of the pipeline.\nFor example, the following sample code checks the input for DataFrames. If it finds a DataFrame, the first column is converted to an array of documents. The array of documents is then passed to\nTfidfVectorizer\nbefore being ingested into the model.\n%python\r\n\r\nclass DataFrameToDocs():\r\n    def transform(self, input_df):\r\n        import pandas as pd\r\n        if isinstance(input_df, pd.DataFrame):\r\n          return input_df[0].values\r\n        else:\r\n          return input_df    def fit(self, X, y=None, **fit_params):\r\n        return self\r\n\r\n\r\nsteps = [('dftodocs', DataFrameToDocs()),('tfidf', TfidfVectorizer()), ('nb_clf', MultinomialNB())]\r\npipeline = Pipeline(steps)\nDelete\nInfo\nWhen input as JSON, both\n[\"Hello\", \"World\"]\nand\n[\n[\"Hello\"],[\"World\"]]\nreturn the same output." +} \ No newline at end of file diff --git a/scraped_kb_articles/increase-tasks-per-stage.json b/scraped_kb_articles/increase-tasks-per-stage.json new file mode 100644 index 0000000000000000000000000000000000000000..ef48442657049cdd5fc36f6e26456a61a6c76172 --- /dev/null +++ b/scraped_kb_articles/increase-tasks-per-stage.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/increase-tasks-per-stage", + "title": "Título do Artigo Desconhecido", + "content": "When using the\nspark-xml\npackage, you can increase the number of tasks per stage by changing the configuration setting\nspark.hadoop.mapred.max.split.size\nto a lower value in the cluster’s\nSpark config\n(\nAWS\n|\nAzure\n). This configuration setting controls the input block size. When data is read from DBFS, it is divided into input blocks, which are then sent to different executors. This configuration controls the size of these input blocks. By default, it is 128 MB (128000000 bytes).\nSetting this value in the notebook with\nspark.conf.set()\nis not effective.\nIn the following example, the\nSpark config\nfield shows that the input block size is 32 MB." +} \ No newline at end of file diff --git a/scraped_kb_articles/increased-job-execution-time-after-migrating-from-all-purpose-to-job-cluster.json b/scraped_kb_articles/increased-job-execution-time-after-migrating-from-all-purpose-to-job-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..7b258c96735fcb0cfd41cb53d5d072a0b2388d2b --- /dev/null +++ b/scraped_kb_articles/increased-job-execution-time-after-migrating-from-all-purpose-to-job-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/increased-job-execution-time-after-migrating-from-all-purpose-to-job-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Apache Spark job, which involves several metastore-related operations such as\nALTER\nor\nMSCK REPAIR\n, runs longer after migrating from all-purpose compute to a job cluster. You also notice a gap between Spark job executions.\nWhen you analyze the thread dump, you find threads stuck at\nHiveClientImpl\n.\nSample thread\nThread-119\" #277 daemon prio=5 os_prio=0 tid=XXXX nid=XXX waiting on condition [XXX]\r\n   java.lang.Thread.State: WAITING (parking)\r\n\tat sun.misc.Unsafe.park(Native Method)\r\n\t- parking to wait for  (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)\r\n\tat java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)\r\n\tat java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)\r\n\tat org.spark_project.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)\r\n\tat org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)\r\n\tat org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)\r\n\tat org.apache.spark.sql.hive.client.LocalHiveClientsPool.super$borrowObject(LocalHiveClientImpl.scala:124)\r\n\tat org.apache.spark.sql.hive.client.LocalHiveClientsPool.$anonfun$borrowObject$1(LocalHiveClientImpl.scala:124)\r\n\tat org.apache.spark.sql.hive.client.LocalHiveClientsPool$$Lambda$5190/556896297.apply(Unknown Source)\r\n\tat com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:410)\r\n\tat com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:396)\r\n\tat com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)\r\n\tat org.apache.spark.sql.hive.client.LocalHiveClientsPool.borrowObject(LocalHiveClientImpl.scala:122)\r\n\tat org.apache.spark.sql.hive.client.PoolingHiveClient.retain(PoolingHiveClient.scala:181)\r\n\tat org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:113)\r\n…\nCause\nThe Hive client pool size in the job cluster is limited to 1, compared to 20 for all purpose compute. This pool size difference causes a bottleneck in the execution when there are a lot of operations involving the metastore.\nSolution\nIn your job cluster settings, under the\nAdvanced options\naccordion in the\nSpark\ntab, set the following configurations in the\nSpark config\nbox to increase the Hive client pool size to align it with the previous all-purpose compute size.\nspark.databricks.hive.metastore.client.pool.size 20\r\nspark.databricks.clusterSource API" +} \ No newline at end of file diff --git a/scraped_kb_articles/increased-wait-times-between-micro-batches-in-auto-loader.json b/scraped_kb_articles/increased-wait-times-between-micro-batches-in-auto-loader.json new file mode 100644 index 0000000000000000000000000000000000000000..cdd741119f28949dc65bd7dadcbe3c97f7b4456f --- /dev/null +++ b/scraped_kb_articles/increased-wait-times-between-micro-batches-in-auto-loader.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/increased-wait-times-between-micro-batches-in-auto-loader", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running an Auto Loader job in directory listing mode, you may experience increased wait time between micro-batches.\nCause\nWhen the input file path is a nested directory path, the job takes time to list all the nested directories. Thus, the job has to wait for worker threads to make progress before processing the next batch, leading to increased wait time between micro-batches.\nSolution\nUse file notification mode instead of the directory listing method. Set\ncloudFiles.useNotifications\nto\ntrue\nin the\nreadStream\noptions. This will save time in listing directories and process the files available in the queue.\nFor more information, please review the\nWhat is Auto Loader file notification mode?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you want to use directory listing mode, avoid using a nested directory path as an input. This will help in reducing the time spent in listing all the directories." +} \ No newline at end of file diff --git a/scraped_kb_articles/init-script-fail-download-maven.json b/scraped_kb_articles/init-script-fail-download-maven.json new file mode 100644 index 0000000000000000000000000000000000000000..fdc32820c410a29190f93b7e0e2de97a0277c455 --- /dev/null +++ b/scraped_kb_articles/init-script-fail-download-maven.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/init-script-fail-download-maven", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an init script that is attempting to install a library via Maven, but it fails when trying to download a JAR.\nhttps://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.4.1/rapids-4-spark_2.12-0.4.1.jar%0D\r\nResolving repo1.maven.org (repo1.maven.org)... 151.101.248.209\r\nConnecting to repo1.maven.org (repo1.maven.org)|151.101.248.209|:443... connected.\r\nHTTP request sent, awaiting response... 404 Not Found\r\n2021-07-30 01:31:11 ERROR 404: Not Found.\nCause\nThere is a carriage return (\n%0D\n) character at the end of one or more of the lines in the init script.\nThis is usually caused by editing a file in Windows and then uploading it to your Databricks workspace without removing the excess carriage returns.\nSolution\nRemove the Windows carriage returns by running\ndos2unix\non the file after you have uploaded it to the workspace.\n%sh\r\n\r\nsudo apt-get install dos2unix -y\r\ndos2unix file \nOnce you have removed the Windows carriage returns from the file, you can configure the init script as normal." +} \ No newline at end of file diff --git a/scraped_kb_articles/init-script-stored-in-volume-fails-with-permission-denied-error.json b/scraped_kb_articles/init-script-stored-in-volume-fails-with-permission-denied-error.json new file mode 100644 index 0000000000000000000000000000000000000000..18db27302f2135d25b979704bc3b7d8dd7f57946 --- /dev/null +++ b/scraped_kb_articles/init-script-stored-in-volume-fails-with-permission-denied-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/init-script-stored-in-volume-fails-with-permission-denied-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile starting a cluster, your init script stored on a volume fails to execute. You see a\nPermission denied\nerror even though you have sufficient permissions on the volume.\nCluster scoped init script /Volumes//script.sh failed: Script exit status is non-zero\nThe following image shows an example of the error in a workspace UI.\nYou then check the init script logs stored in the Databricks File System (DBFS) using a notebook.\ndbutils.fs.head(“dbfs:/.bash.stderr.log”,10000)\nYou see the following error message.\n‘bash:line 11: /Volume/\r\nSh: Permission denied\\n’\nCause\nDuring cluster startup, the init script path is resolved using the cluster owner's privileges. If the owner lacks access to the volume or script, the initialization fails even if the user starting the cluster does have the necessary permissions.\nSolution\nEnsure that the cluster owner has access to the volume and the init script.\nTo identify the creator or owner of a cluster:\nNavigate to the cluster configuration UI.\nClick the kebab menu and select\nJSON\n.\nIn the JSON view, look for the field\ncreator_user_name\nto find the creator’s username.\nTo grant the necessary permissions on the volume:\nNavigate to the Catalog Explorer.\nClick on the catalog that contains the volume.\nGo to the\nPermissions\ntab.\nClick the\nGrant\nbutton.\nAssign the required permissions on the volume to the creator.\nThe following image shows the UI after navigating to the\nPermissions\ntab and highlights the\nGrant\nbutton location on the screen." +} \ No newline at end of file diff --git a/scraped_kb_articles/init-script-stored-on-a-volume-fails-to-execute-on-cluster-start.json b/scraped_kb_articles/init-script-stored-on-a-volume-fails-to-execute-on-cluster-start.json new file mode 100644 index 0000000000000000000000000000000000000000..3289e00c5ac5fd7be70eeaea623288f490d24fd7 --- /dev/null +++ b/scraped_kb_articles/init-script-stored-on-a-volume-fails-to-execute-on-cluster-start.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/init-script-stored-on-a-volume-fails-to-execute-on-cluster-start", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use an init script stored on a Unity Catalog volume path, but your cluster fails to start. The error message in the cluster event logs is generic.\nScript exit status is non-zero\nThe specific error message can be found in the\ninit script logs\n(\nAWS\n|\nAzure\n|\nGCP\n).\nbash: /Volumes/abc-init.sh: /bin/bash^M: bad interpreter: No such file or directory\nIf the init script is uploaded as a workspace file instead of stored on a volume, the cluster starts normally.\nCause\nInit script execution fails when the new line code used in the init script is a carriage return + line feed (\nCRLF\n) instead of a line feed. Linux uses the line feed as the new line code, while Windows uses a carriage return + line feed as the new line code. When a text file is created on a Windows system, it defaults to using a carriage return + line feed.\nWhen text files are uploaded as workspace files, the carriage return + line feed is converted to line feed. When text files are uploaded to Unity Catalog volumes, this conversion does not happen. As a result, the init script cannot be processed correctly by the cluster when it starts up.\nSolution\nAny init script uploaded to a volume must use a line feed as a new line.\nIf you created your init scripts on a Windows system you can:\nConvert all carriage return + line feed new lines to line feed new lines in your file before uploading.\nUpload your init script as a workspace file. This automatically converts the init script. You can then copy the init script from workspace files to a volume and it will work as expected.\nYou can create your init script directly in a Databricks notebook.\nExample\nThis sample code creates an init script called test-init.sh in /Volumes/test. It echoes the word\nTEST\nwhen run.\n%python\r\n\r\ndbutils.fs.put(\"/Volumes/test/test-init.sh\", \r\n\"\"\"#!/bin/bash\r\necho \"TEST\"\r\n\"\"\", overwrite=True)\nInfo\nVerify that the first line of the init script is\n#!/bin/bash\n. This ensures the init script is executed with the correct shell interpreter." +} \ No newline at end of file diff --git a/scraped_kb_articles/init-script-to-set-up-dask-library-fails-and-cluster-won%E2%80%99t-start.json b/scraped_kb_articles/init-script-to-set-up-dask-library-fails-and-cluster-won%E2%80%99t-start.json new file mode 100644 index 0000000000000000000000000000000000000000..97bf3ad17afd9ec00156e70fa232d40acecad1dd --- /dev/null +++ b/scraped_kb_articles/init-script-to-set-up-dask-library-fails-and-cluster-won%E2%80%99t-start.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/init-script-to-set-up-dask-library-fails-and-cluster-won%E2%80%99t-start", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re attempting to set up the Dask library using an init script from the Nvidia tutorial\nRAPIDS on Databricks: A Guide to GPU-Accelerated Data Processing\nand the cluster fails to start, displaying the error\nScript exit status is non-zero\n.\nExample init script\n```bash\r\n#!/bin/bash\r\nset -e\r\n\r\n# Install RAPIDS (cudf & dask-cudf) and dask-databricks\r\n/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \\\r\n      cudf-cu11 \\\r\n      dask[complete] \\\r\n      dask-cudf-cu11  \\\r\n      dask-cuda==24.04 \\\r\n      Dask-databricks\r\n\r\n# Start Dask cluster with CUDA workers\r\ndask databricks run --cuda\r\n```\nWhen you run the command\ndask databricks run –-cuda\nin a notebook, you receive a separate error.\n[22:25:01] INFO Setting up Dask on a Databricks cluster. ]8;id=856276;file:///databricks/python/lib/python3.10/site-packages/dask_databricks/cli.py\\cli.py]8;;\\:]8;id=XXXX;file:///databricks/python/lib/python3.10/site-packages/dask_databricks/cli.py#37\\37]8;;\\\r\n           ERROR  Unable to find expected environment variables  ]8;id=942;file:///databricks/python/lib/python3.10/site-packages/dask_databricks/cli.py\\cli.py]8;;\\:]8;id=XXXX;file:///databricks/python/lib/python3.10/site-packages/dask_databricks/cli.py#43\\43]8;;\\\r\n                    DB_IS_DRIVER and DB_DRIVER_IP. Are you running              \r\n                    this command on a Databricks multi-node cluster?\nCause\nThe init script failure causes the cluster not to start. The init script fails because the\ndask databricks run --cuda\ncommand is executed before the necessary environment variables,\nDB_IS_DRIVER\nand\nDB_DRIVER_IP\n, are set.\nSolution\nModify your init script to include a validation check for the required environment variables before executing the\ndask databricks run --cuda\ncommand.\nExample init script with validation checks\nThis script installs the required packages and then checks if\nDB_IS_DRIVER\nand\nDB_DRIVER_IP\nare set. If they are, it starts the Dask cluster with CUDA workers; otherwise, it skips the startup.\n```bash\r\n#!/bin/bash\r\nset -e\r\n\r\n# Install RAPIDS (cudf & dask-cudf) and dask-databricks\r\n/databricks/python/bin/pip install --extra-index-url=https://pypi.nvidia.com \\\r\n      cudf-cu11 \\\r\n      dask[complete] \\\r\n      dask-cudf-cu11  \\\r\n      dask-cuda==24.04 \\\r\n      dask-databricks\r\n\r\n# Check if the necessary environment variables are set\r\nif [[ -n \"$DB_IS_DRIVER\" && -n \"$DB_DRIVER_IP\" ]]; then\r\n  echo \"Environment variables are set. Starting Dask cluster with CUDA workers.\"\r\n  dask databricks run --cuda\r\nelse\r\n  echo \"Required environment variables DB_IS_DRIVER and DB_DRIVER_IP are not set. Skipping Dask cluster startup.\"\r\nfi\r\n```\nAdditionally, ensure that you are using Databricks Runtime 14.2 ML (which includes Apache Spark 3.5.0, GPU, and Scala 2.12) to avoid Python dependency problems. Use a single g4dn.xlarge node with GPU attached so the initialization script finishes successfully." +} \ No newline at end of file diff --git a/scraped_kb_articles/init-scripts-failing-with-unexpected-end-of-file-error.json b/scraped_kb_articles/init-scripts-failing-with-unexpected-end-of-file-error.json new file mode 100644 index 0000000000000000000000000000000000000000..c4ec4b29c5af17b7509003ab32d113d67ca9d9d6 --- /dev/null +++ b/scraped_kb_articles/init-scripts-failing-with-unexpected-end-of-file-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/init-scripts-failing-with-unexpected-end-of-file-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you create init scripts using Windows Notepad or similar Integrated Development Environments (IDEs) and directly upload these files, you encounter the following error.\n//: line n: syntax error: unexpected end of file\nCause\nWindows Notepad and similar editors insert special characters like carriage returns (\n\\r\n), which are not visible within the script but are incompatible with Linux-based environments. Since Databricks clusters operate on a Linux OS, these characters cause the script to fail.\nSolution\nFollow the steps in the Databricks KB article\nInit script stored on a volume fails to execute on cluster start\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/inner-join-drops-records-in-result.json b/scraped_kb_articles/inner-join-drops-records-in-result.json new file mode 100644 index 0000000000000000000000000000000000000000..8412012d44f71e024f5fa52c611214e2225f2831 --- /dev/null +++ b/scraped_kb_articles/inner-join-drops-records-in-result.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/inner-join-drops-records-in-result", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou perform an inner join, but the resulting joined table is missing data.\nFor example, assume you have two tables,\norders\nand\nmodels\n.\n%python\r\n\r\ndf_orders = spark.createDataFrame([('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima',''), ('Nissan','Altima', None)], [\"Company\", \"Model\", \"Info\"])\n%python\r\n\r\ndf_models = spark.createDataFrame([('Nissan','Altima',''), ('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','2-door 3.5 SE Coupe'), ('Nissan','Altima','4-door 2.5 S Sedan'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima','4-door 3.5 SL Sedan'), ('Nissan','Altima','4-door HYBRID Sedan'), ('Nissan','Altima',None)], [\"Company\", \"Model\", \"Info\"])\nYou attempt a straight join of the two tables.\n%python\r\ndf_orders.createOrReplaceTempView(\"Orders\")\r\ndf_models.createOrReplaceTempView(\"Models\")\r\nSQL\r\nCopy to clipboardCopy\r\nSELECT *\r\nMAGIC FROM Orders a\r\nMAGIC INNER JOIN Models b\r\nMAGIC ON a.Company = b.Company\r\nMAGIC AND a.Model = b.Model\r\nMAGIC AND a.Info = b.Info\nThe resulting joined table only includes three of the four records from the\norders\ntable. The record with a\nnull\nvalue in a column does not appear in the results.\nCause\nApache Spark does not consider\nnull\nvalues when performing a join operation.\nIf you attempt to join tables, and some of the columns contain\nnull\nvalues, the\nnull\nrecords will not be included in the resulting joined table.\nSolution\nIf your source tables contain\nnull\nvalues, you should use the Spark\nnull\nsafe operator (\n<=>\n).\nWhen you use\n<=>\nSpark processes\nnull\nvalues (instead of dropping them) when performing a join.\nFor example, if we modify the sample code with\n<=>\n, the resulting table does not drop the\nnull\nvalues.\n%sql\r\n\r\nSELECT *\r\nMAGIC FROM Orders a\r\nMAGIC INNER JOIN Models b\r\nMAGIC ON a.Company = b.Company\r\nMAGIC AND a.Model = b.Model\r\nMAGIC AND a.Info <=> b.Info\nExample notebook\nReview the\nInner join drops null values example notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/insert-operation-fails-while-trying-to-execute-multiple-concurrent-insert-or-merge-operations-to-append-data.json b/scraped_kb_articles/insert-operation-fails-while-trying-to-execute-multiple-concurrent-insert-or-merge-operations-to-append-data.json new file mode 100644 index 0000000000000000000000000000000000000000..946b539d223de633e64f1bd5802b3a1663730c54 --- /dev/null +++ b/scraped_kb_articles/insert-operation-fails-while-trying-to-execute-multiple-concurrent-insert-or-merge-operations-to-append-data.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/insert-operation-fails-while-trying-to-execute-multiple-concurrent-insert-or-merge-operations-to-append-data", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to execute multiple concurrent\nINSERT\nor\nMERGE\noperations to append data to a table, the\nINSERT\noperation fails.\nConcurrentAppendException Files were added to the root of the table by a concurrent update. Please try the operation again.\r\nConflicting commit:\r\n{\"timestamp\":xxxxx,\"userId\":\"xxxxx\",\"userName\":\"xxxxx\",\"operation\":\"WRITE\",\"operationParameters\":{\"mode\":Append,\"statsOnLoad\":false,\"partitionBy\":[]},\"job\":{\"jobId\":\"xxxxx\",\"jobName\":\"xxxxx\",\"jobRunId\":\"xxxxx\",\"runId\":\"xxxxx\",\"jobOwnerId\":\"xxxxx\",\"triggerType\":\"manual\"},\"notebook\":{\"notebookId\":\"xxxxx\"},\"clusterId\":\"xxxxx\",\"readVersion\":xxxxx,\"isolationLevel\":\"WriteSerializable\",\"isBlindAppend\":false,\"operationMetrics\":{\"numFiles\":\"xx\",\"numOutputRows\":\"xx\",\"numOutputBytes\":\"xxxxx\"},\"tags\":{\"restoresDeletedRows\":\"false\"},\"engineInfo\":\"Databricks-Runtime/13.3.x-aarch64-scala2.12\",\"txnId\":\"xxxxx\"}\nCause\nWhen multiple operations try to access the same data at once, they interfere with each other.\nSolution\nTrace back the operation that concurred by following the\nConflicting commit\nfrom the error message.\nEnsure that the isolation level is set appropriately for your use case.\nWhen you expect concurrent inserts, modify relevant queries to set\nblindAppend\nto\ntrue\n.\nIf you use other operations such as\nUPDATE\n,\nDELETE\n,\nMERGE INTO\n, or\nOPTIMIZE\n, consult the documentation for expected write conflicts.\nFor more information, review the\nIsolation levels and write conflicts on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nNote\nIf you don’t expect concurrency to occur often, set up retries to check for conflicts during concurrent appends." +} \ No newline at end of file diff --git a/scraped_kb_articles/insert-overwrite-directory-with-hive-format-failing-with-specified-path-already-exists-error.json b/scraped_kb_articles/insert-overwrite-directory-with-hive-format-failing-with-specified-path-already-exists-error.json new file mode 100644 index 0000000000000000000000000000000000000000..d623e75f462c35a8d0ccf1422130cb53958ca18e --- /dev/null +++ b/scraped_kb_articles/insert-overwrite-directory-with-hive-format-failing-with-specified-path-already-exists-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/insert-overwrite-directory-with-hive-format-failing-with-specified-path-already-exists-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re using Databricks Runtime 12.2 LTS or above on a No Isolation Shared compute to perform an\nINSERT OVERWRITE DIRECTORY\noperation with Hive format. The following code is an example.\nINSERT OVERWRITE DIRECTORY '' SELECT * FROM ..
\nThe operation fails with the following error message.\nSparkException: Failed inserting overwrite directory \r\n...\r\nCaused by: Operation failed: \"The specified path already exists.\", 409,PUT,\r\n:///////?resource=file&timeout=90, PathAlreadyExists, \"The specified path already exists. RequestId:xxxxxxxxxxx Time:yyyy-mm-ddThh:mm:ssZ\"\nCause\nAs of Databricks Runtime 12.2 LTS, Hadoop Distributed File System (HDFS) has modified its handling of directory overwriting operations when using Hive format table queries. When a directory already exists before executing\nINSERT OVERWRITE DIRECTORY\nwith Hive format, the error occurs.\nSolution\nUse a Parquet directory write instead. Add\nUSING PARQUET\nafter the\nINSERT OVERWRITE DIRECTORY\nas shown in the following example.\nINSERT OVERWRITE DIRECTORY '' USING PARQUET SELECT * FROM ..
" +} \ No newline at end of file diff --git a/scraped_kb_articles/install-cartopy-on-cluster.json b/scraped_kb_articles/install-cartopy-on-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..e87c9089ee1eebc4a015ba76ba2075c87c0ef8ea --- /dev/null +++ b/scraped_kb_articles/install-cartopy-on-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/install-cartopy-on-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to install\nCartopy\non a cluster and you receive a\nManagedLibraryInstallFailed\nerror message.\njava.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, cartopy==0.17.0, --disable-pip-version-check) exited with code 1.   ERROR: Command errored out with exit status 1:\r\n   command: /databricks/python3/bin/python3.7 /databricks/python3/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpjoliwaky\r\n       cwd: /tmp/pip-install-t324easa/cartopy\r\n  Complete output (3 lines):\r\n  setup.py:171: UserWarning: Unable to determine GEOS version. Ensure you have 3.3.3 or later installed, or installation may fail.\r\n    '.'.join(str(v) for v in GEOS_MIN_VERSION), ))\r\n  Proj 4.9.0 must be installed.\r\n  ----------------------------------------\r\nERROR: Command errored out with exit status 1: /databricks/python3/bin/python3.7 /databricks/python3/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpjoliwaky Check the logs for full command output.\r\n for library:PythonPyPiPkgId(cartopy,Some(0.17.0),None,List()),isSharedLibrary=false\nCause\nCartopy\nhas dependencies on\nlibgeos\n3.3.3 and above and\nlibproj\n4.9.0. If\nlibgeos\nand\nlibproj\nare not installed,\nCartopy\nfails to install.\nSolution\nConfigure a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n) to automatically install\nCartopy\nand the required dependencies.\nCreate the base directory to store the init script in, if the base directory does not exist. Here, use\ndbfs:/databricks/\nas an example.\n%sh\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the script and save it to a file.\n%sh\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//cartopy.sh\",\"\"\"\r\n#!/bin/bash\r\nsudo apt-get install libgeos++-dev -y\r\nsudo apt-get install libproj-dev -y\r\n/databricks/python/bin/pip install Cartopy\r\n\"\"\",True)\nCheck that the script exists.\n%python\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//cartopy.sh\"))\nOn the cluster configuration page, click the\nAdvanced Options\ntoggle.\nAt the bottom of the page, click the\nInit Scripts\ntab.\nIn the\nDestination\ndrop-down, select DBFS, provide the file path to the script, and click\nAdd\n.\nRestart the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/install-package-cran-snapshot.json b/scraped_kb_articles/install-package-cran-snapshot.json new file mode 100644 index 0000000000000000000000000000000000000000..7d3cb35380e4a386e24f763113272c4b6380a38d --- /dev/null +++ b/scraped_kb_articles/install-package-cran-snapshot.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/install-package-cran-snapshot", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to install a library package via CRAN, and are getting a\nLibrary installation failed for library due to infra fault\nerror message.\nLibrary installation failed for library due to infra fault for Some(cran {\r\npackage: \"\"\r\n}\r\n). Error messages:\r\njava.lang.RuntimeException: Installation failed with message:\r\n\r\nError installing R package: Could not install package with error: installation of package had non-zero exit status\nCause\nCRAN maintains daily snapshots. If a snapshot is invalid for some reason, you get an error when trying to install a package.\nSolution\nSpecify a previous snapshot when installing your library.\nCheck the date of the most recent snapshot at\nhttps://cran.microsoft.com/snapshot/\n.\nPick a date that is at least one day older than the most recent snapshot. For example, if the most recent snapshot is dated July 9, 2021, you should use July 8, 2021.\nEnter the full URL to your chosen snapshot in the\nRepository\nfield when you install the package on your cluster (\nAWS\n|\nAzure\n|\nGCP\n). To use the July 8, 2021 snapshot enter\nhttps://cran.microsoft.com/snapshot/2021-07-08/\nas the full URL for the repository." +} \ No newline at end of file diff --git a/scraped_kb_articles/install-private-pypi-repo.json b/scraped_kb_articles/install-private-pypi-repo.json new file mode 100644 index 0000000000000000000000000000000000000000..99da47a69b812f639135e8bf615ec1491fa3e5cb --- /dev/null +++ b/scraped_kb_articles/install-private-pypi-repo.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/install-private-pypi-repo", + "title": "Título do Artigo Desconhecido", + "content": "Certain use cases may require you to install libraries from private PyPI repositories.\nIf you are installing from a public repository, you should review the\nlibrary documentation\n.\nThis article shows you how to configure an example init script that authenticates and downloads a PyPI library from a private repository.\nCreate init script\nCreate (or verify) a directory to store the init script.\n\nis the name of the folder where you store your init scripts.\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the init script.\ndbutils.fs.put(\"/databricks//private-pypi-install.sh\",\"\"\"\r\n#!/bin/bash\r\n/databricks/python/bin/pip install --index-url=https://${}:${}@ private-package==\r\n\"\"\", True)\nVerify that your init script exists.\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//private-pypi-install.sh\"))\nInstall as a cluster-scoped init script\nInstall the init script that you just created as a\ncluster-scoped init script\n.\nYou will need the full path to the location of the script\n(dbfs:/databricks//private-pypi-install.sh)\n.\nRestart the cluster\nRestart your cluster after you have installed the init script.\nOnce the cluster starts up, verify that it successfully installed the custom library from the private PyPI repository.\nIf the custom library is not installed, double check the username and password that you set for the private PyPI repository in the init script.\nUse the init script with a job cluster\nOnce you have the init script created, and verified working, you can include it in a\ncreate-job.json\nfile when using the\nJobs API\nto start a job cluster.\n{\r\n  \"cluster_id\": \"1202-211320-brick1\",\r\n  \"num_workers\": 1,\r\n  \"spark_version\": \"\",\r\n  \"node_type_id\": \"\",\r\n  \"cluster_log_conf\": {\r\n    \"dbfs\" : {\r\n      \"destination\": \"dbfs:/cluster-logs\"\r\n    }\r\n  },\r\n  \"init_scripts\": [ {\r\n    \"dbfs\": {\r\n      \"destination\": \"dbfs:/databricks//private-pypi-install.sh\"\r\n    }\r\n  } ]\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/install-pygraphviz.json b/scraped_kb_articles/install-pygraphviz.json new file mode 100644 index 0000000000000000000000000000000000000000..1758ee97d0bc71643e83f50ac474dd001ee574e8 --- /dev/null +++ b/scraped_kb_articles/install-pygraphviz.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/install-pygraphviz", + "title": "Título do Artigo Desconhecido", + "content": "PyGraphViz\nPython libraries are used to plot causal inference networks.\nIf you try to install\nPyGraphViz\nas a standard library, it fails due to dependency errors.\nPyGraphViz\nhas the following dependencies:\npython3-dev\ngraphviz\nlibgraphviz-dev\npkg-config\nInstall via notebook\nInstall the dependencies with\napt-get\n.\n%sh\r\n\r\nsudo apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config\nAfter the dependencies are installed, use\npip\nto install\nPyGraphViz\n.\n%sh\r\n\r\npip install pygraphviz\nInstall via init script\nCreate the init script.\n%python\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//install-pygraphviz.sh\",\r\n\"\"\"\r\n#!/bin/bash\r\n#install dependent packages\r\nsudo apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config\r\npip install pygraphviz\"\"\", True)\nInstall the init script that you just created as a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nYou will need the full path to the location of the script (\ndbfs:/databricks//install-pygraphviz.sh\n).\nRestart the cluster after you have installed the init script." +} \ No newline at end of file diff --git a/scraped_kb_articles/install-pyodbc-on-cluster.json b/scraped_kb_articles/install-pyodbc-on-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..cfdf5cea1c137010c41747351098ca16544942cb --- /dev/null +++ b/scraped_kb_articles/install-pyodbc-on-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/install-pyodbc-on-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nOne of the following errors occurs when you use\npip\nto install the\npyodbc\nlibrary.\njava.lang.RuntimeException: Installation failed with message: Collecting pyodbc\n\"Library installation is failing due to missing dependencies. sasl and thrift_sasl are optional dependencies for SASL or Kerberos support\"\nCause\nAlthough\nsasl\nand\nthrift_sasl\nare optional dependencies for SASL or Kerberos support, they need to be present for\npyodbc\ninstallation to succeed.\nSolution\nCluster-scoped init script method\nYou can put these commands into a single init script and attach it to the cluster. This ensures that the dependent libraries for pyodbc are installed before the cluster starts.\nCreate the base directory to store the init script in, if the base directory does not exist. Here, use\ndbfs:/databricks/\nas an example.\n%sh\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the script and save it to a file.\n%sh\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//tornado.sh\",\"\"\"\r\n#!/bin/bash\r\npip list | egrep 'thrift-sasl|sasl'\r\npip install --upgrade thrift\r\ndpkg -l | egrep 'thrift_sasl|libsasl2-dev|gcc|python-dev'\r\nsudo apt-get -y install unixodbc-dev libsasl2-dev gcc python-dev\r\n\"\"\",True)\nCheck that the script exists.\n%python\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//tornado.sh\"))\nOn the cluster configuration page, click the\nAdvanced Options\ntoggle.\nAt the bottom of the page, click the\nInit Scripts\ntab.\nIn the\nDestination\ndrop-down, select DBFS, provide the file path to the script, and click\nAdd\n.\nRestart the cluster.\nFor more details about cluster-scoped init scripts, see Cluster-scoped init scripts (\nAWS\n|\nAzure\n|\nGCP\n).\nNotebook method\nIn a notebook, check the version of\nthrift\nand upgrade to the latest version.\n%sh\r\n\r\npip list | egrep 'thrift-sasl|sasl'\r\npip install --upgrade thrift\nEnsure that dependent packages are installed.\n%sh\r\n\r\ndpkg -l | egrep 'thrift_sasl|libsasl2-dev|gcc|python-dev'\nInstall\nnnixodbc\nbefore installing\npyodbc\n.\n%sh\r\n\r\nsudo apt-get -y install unixodbc-dev libsasl2-dev gcc python-dev" +} \ No newline at end of file diff --git a/scraped_kb_articles/install-rjava-rjdbc-libraries.json b/scraped_kb_articles/install-rjava-rjdbc-libraries.json new file mode 100644 index 0000000000000000000000000000000000000000..f66581e7f0c55b910e0b4213d1dd7ae907ed9582 --- /dev/null +++ b/scraped_kb_articles/install-rjava-rjdbc-libraries.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/install-rjava-rjdbc-libraries", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to install rJava and RJBDC libraries.\nProblem\nWhen you install rJava and RJDBC libraries with the following command in a notebook cell:\n%r\r\n\r\ninstall.packages(c(\"rJava\", \"RJDBC\"))\nYou observe the following error:\nERROR: configuration failed for package 'rJava'\nCause\nThe rJava and RJDBC packages check for Java dependencies and file paths that are not present in the Databricks R directory.\nSolution\nFollow the steps below to install these libraries on running clusters.\nRun following commands in a\n%sh\ncell.\n%sh\r\n\r\nls -l /usr/bin/java\r\nls -l /etc/alternatives/java\r\nln -s /usr/lib/jvm/java-8-openjdk-amd64 /usr/lib/jvm/default-java\r\nR CMD javareconf\nInstall the rJava and RJDBC packages.\n%r\r\n\r\ninstall.packages(c(\"rJava\", \"RJDBC\"))\nVerify that the rJava package is installed.\n%r\r\n\r\ndyn.load('/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so')\r\nlibrary(rJava)" +} \ No newline at end of file diff --git a/scraped_kb_articles/install-turbodbc.json b/scraped_kb_articles/install-turbodbc.json new file mode 100644 index 0000000000000000000000000000000000000000..ef64ebe8fec7a556434b48e365b9c4d672eebb9f --- /dev/null +++ b/scraped_kb_articles/install-turbodbc.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/install-turbodbc", + "title": "Título do Artigo Desconhecido", + "content": "Turbodbc\nis a Python module that uses the ODBC interface to access relational databases.\nIt has dependencies on\nlibboost-all-dev\n,\nunixodbc-dev\n, and\npython-dev\npackages, which need to be installed in order.\nYou can install these manually, or you can use an init script to automate the install.\nCreate the init script\nRun this sample script in a notebook to create the init script on your cluster.\n%python\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/\")\r\ndbutils.fs.put(\"dbfs://turbodbc_install.sh\", \"\"\"\r\n#!/bin/bash\r\n#install dependent packages\r\nsudo apt-get -y install libboost-all-dev unixodbc-dev python-dev\r\npip install turbodbc==4.1.1\r\n\"\"\",True)\nRemember the path to the init script. You will need it when configuring your cluster.\nConfigure the init script\nFollow the documentation to configure a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nSpecify the path to the init script. Use the same path that you used in the sample script.\nAfter configuring the init script, restart the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/installing-lme4-fails-with-a-matrix-version-error.json b/scraped_kb_articles/installing-lme4-fails-with-a-matrix-version-error.json new file mode 100644 index 0000000000000000000000000000000000000000..b11f334d5f1b13125b7378f8a6a44134c0b8af1b --- /dev/null +++ b/scraped_kb_articles/installing-lme4-fails-with-a-matrix-version-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/installing-lme4-fails-with-a-matrix-version-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to install the\nlme4\nR library on a cluster running Databricks Runtime 15.4 LTS ML or below when you see error messages indicating that the required version of the\nMatrix\npackage is not available.\nError: package 'Matrix' 1.5-1 was found, but >= 1.6.2 is required by 'lme4'.\nThis prevents the successful installation and usage of the\nlme4\npackage, which is essential for certain statistical modeling tasks.\nCause\nThere is a dependency issue between the version of R included in the Databricks Runtime you are using and the requirements for the\nlme4\npackage. Specifically,\nlme4\nrequires\nMatrix\nversion 1.6.2 or above.\nSolution\nYou can upgrade to Databricks Runtime 16.0 and above, which includes an updated version of\nMatrix\n.\nIf you do not want to use a newer Databricks Runtime, you can pin the required library versions in the cluster configuration or you can create an init script that installs the libraries when your cluster starts.\nPin library versions\nYou must upgrade the version of\nMatrix\nto 1.6.2 or above before you can successfully install\nlme4\n. You can install both libraries as\ncluster libraries\n(\nAWS\n|\nAzure\n|\nGCP\n) from the workspace UI.\nSelect\nCRAN\nas the\nLibrary Source\nand enter the specific versions:\nMatrix==1.6-2\nlme4==1.1-28\nInstall via init script\nCreate an init script with the following content:\n%sh\r\n\r\n#!/bin/bash\r\n\r\n# Update the package list\r\nsudo apt-get update\r\n\r\n# Install necessary dependencies for R packages\r\nsudo apt-get install -y libcurl4-openssl-dev libxml2-dev libssl-dev\r\n\r\n# Install the Matrix package version 1.6-2 from the CRAN archive\r\nsudo R -e \"install.packages('remotes')\"\r\nsudo R -e \"remotes::install_version('Matrix', version = '1.6-2', repos = 'http://cran.us.r-project.org')\"\r\n\r\n# Install the lme4 package\r\nsudo R -e \"install.packages('remotes', repos = 'http://cran.us.r-project.org'); remotes::install_version('lme4', version = '1.1-28', repos = 'http://cran.us.r-project.org')\"\r\n\r\n# Test loading the lme4 package\r\n# sudo R -e \"library('lme4')\"\nSave the init script as a workspace file, to a Unity Catalog volume, or cloud storage.\nConfigure the init script as a\ncluster-scoped init script\n(\nAWS\n|\nAzure\n|\nGCP\n) or a\nglobal init script\n(\nAWS\n|\nAzure\n|\nGCP\n) depending on your use case.\nRestart your cluster to apply the changes. The init script runs during cluster startup, installing the required versions of R and its dependencies, including the\nlme4\npackage." +} \ No newline at end of file diff --git a/scraped_kb_articles/insufficient-permission-error-when-querying-views-in-dedicated-access-mode.json b/scraped_kb_articles/insufficient-permission-error-when-querying-views-in-dedicated-access-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..a27bc6efa386c874821a0975f72255c18dd8e221 --- /dev/null +++ b/scraped_kb_articles/insufficient-permission-error-when-querying-views-in-dedicated-access-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/insufficient-permission-error-when-querying-views-in-dedicated-access-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are querying a view on a cluster running in dedicated (formerly single user) access mode when you get an insufficient permission error.\nAnalysisException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:    '.'.\nCause\nWhen running Databricks Runtime 15.3 and below in dedicated access mode, any user who runs a query on a view must have the\nSELECT\npermission on the tables and views referenced by the view. Data filtering functionality that enables fine-grained access control is available in Databricks Runtime 15.4 and above.\nSolution\nYou can resolve this issue in multiple ways.\nGet the necessary privileges. Ask an admin to ensure you have:\nThe\nUSE CATALOG\nprivilege on the catalog.\nThe\nUSE SCHEMA\nprivilege on the schema.\nThe\nSELECT\nprivilege on any referenced tables or views.\nIf you must use Databricks Runtime 15.3 or below, use an alternative compute resource. You can use an SQL warehouse or use compute running in standard (formerly shared) access mode.\nUpgrade your Databricks Runtime version (on AWS and Azure). Databricks Runtime 15.4 or above allows the use of dedicated access mode along with data filtering. The user who queries the view does not need access to the referenced tables and views.\nThe workspace must have serverless compute enabled for jobs, notebooks, and Delta Live Tables.\nFor more information, review the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/insufficient-privileges-error-when-querying-views-in-a-unity-catalog-metastore.json b/scraped_kb_articles/insufficient-privileges-error-when-querying-views-in-a-unity-catalog-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..44aaf1c9c24a675bec87b7bf9f0224e16a13070f --- /dev/null +++ b/scraped_kb_articles/insufficient-privileges-error-when-querying-views-in-a-unity-catalog-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/insufficient-privileges-error-when-querying-views-in-a-unity-catalog-metastore", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen querying views in complex, multi-layer view setups in a Unity Catalog metastore, you encounter an insufficient table permissions error.\nYour request failed with status FAILED: [BAD_REQUEST] [INSUFFICIENT_PERMISSIONS] Insufficient privileges: Table 'test_table1' does not have sufficient privilege to execute because the owner of one of the underlying resources failed an authorization check. SQLSTATE: 42501\nNote\nThis article assumes that you have all requirements for querying views. For details, review the\nWhat is a view?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCause\nYou’re working in an environment with multiple ownership layers between views and underlying tables that hold the actual data.\nThere are missing permissions in the permission hierarchy for dependent views or tables, where the condition of “view table owner must have select access to its upstream tables/views” is not met.\nContext\nIn scenarios where there is only one layer, meaning the view table was created directly from a table(s) that holds data, only the view owner’s permission against the source table(s) will need to be checked.\nIn more complex scenarios where there are multiple layers, meaning the view table is created from view table(s), those view tables might also be created from more view tables, and there are multiple layers of reference before reaching the source table.\nSolution\nTrace back the permission tree for the view table. To do so, you need Unity Catalog metastore admin privileges. If you do not have admin privileges, contact your metastore admin to assist.\nFind the owner of the view table giving you the error message. The owner’s permission on the upstream table is required for the view table to work.\nUse the lineage graph of the view table to identify its upstream table(s).\nCheck if the owner of the view table has at least SELECT access to each of the upstream table(s). Note: the owner of the view table might not be found in the list of users with granted permission if the upstream table is owned by the same user.\nRepeat the steps above for each view table if the upstream table is a view table, until you reach the data table.\nExample illustrating the traceback procedure\nThe following diagram illustrates the solution steps, and is accompanied by an example description set of steps to make the traceback instruction steps more concrete.\nStart with the problematic view table (View Table4)\nThe issue happens when trying to query View Table4. The table owner is userA.\nThe upstreams of View Table4 are View Table2 and View Table3.\nExamine View Table2\nCheck if userA has SELECT access on View Table2.\nUserA is also the owner of View Table2, so the permission is good.\nThen check the upstream of View Table2, which is Data Table1.\nUserA is also the owner of Data Table1, so the permission is good.\nYou have reached the base table where the actual data is. The route is good.\nExamine View Table3\nNext, check if userA has SELECT access on the other view table initially identified, View Table3.\nIf userA does not have SELECT permission on View Table3, grant permission here.\nIf userA does have SELECT permission on View Table3, check if the owner of View Table3 (userD) has SELECT permission to the upstream tables, which are View Table1 and Data Table4.\nExamine View Table1\nIf userD does not have SELECT permission on View Table1, grant permission here.\nCheck if the owner of View Table1 (userB) has SELECT permission on the upstream tables, which are Data Table2 and Data Table3.\nUserB is also the owner of Data Table 2. The permission is good.\nUserB does not have SELECT permission to Data Table3. Grant permission here.\nExamine Data Table4\nUserD is also the owner of Data Table4. The permission is good.\nYou have reached the base table where the actual data is. This route is good.\nOnce you or your metastore admin have checked all the routes from the problematic view table to the data table, and ensured that each view table has the necessary\nSELECT\naccess on its upstream table(s), the\n“INSUFFICIENT_PERMISSIONS”\nerror should resolve." +} \ No newline at end of file diff --git a/scraped_kb_articles/intermittent-long-running-optimize-command-for-liquid-clustered-table.json b/scraped_kb_articles/intermittent-long-running-optimize-command-for-liquid-clustered-table.json new file mode 100644 index 0000000000000000000000000000000000000000..6f1156320cf057be4743ba80bae3a358f932a485 --- /dev/null +++ b/scraped_kb_articles/intermittent-long-running-optimize-command-for-liquid-clustered-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/intermittent-long-running-optimize-command-for-liquid-clustered-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you run the\nOPTIMIZE\ncommand on a table with liquid clustering, you notice it sometimes takes several hours instead of a minute or so.\nYou check the table history for the operation metrics using the\nDESCRIBE HISTORY\ncommand and notice that the long-running\nOPTIMIZE\nqueries compact more files than you usually see.\nCause\nLiquid clustering occasionally needs to rebalance its internal metadata to ensure that it closely matches the table state and can be easily updated for new insertions to the table.\nIn Databricks Runtime versions 16.1 and below, rebalancing requires collecting samples from the entire table, and liquid clustering may have to rewrite files after a metadata update if new metadata differs from the previous state.\nThis issue doesn’t occur right away on a table where rebalancing has already occurred. If you notice in the Apache Spark UI that the\nOPTIMIZE\nquery scans all the files in a table, the table is likely undergoing metadata rebalancing.\nSolution\nUse Databricks Runtime 16.2 or above to run the\nOPTIMIZE\ncommand on your table with liquid clustering." +} \ No newline at end of file diff --git a/scraped_kb_articles/intermittent-nullpointerexception-aqe-enabled.json b/scraped_kb_articles/intermittent-nullpointerexception-aqe-enabled.json new file mode 100644 index 0000000000000000000000000000000000000000..eb66d5c587252ef3b57db2c71343168d0c67d194 --- /dev/null +++ b/scraped_kb_articles/intermittent-nullpointerexception-aqe-enabled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/intermittent-nullpointerexception-aqe-enabled", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou get an intermittent\nNullPointerException\nerror when saving your data.\nPy4JJavaError: An error occurred while calling o2892.save.\r\n: java.lang.NullPointerException\r\n    at org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin.$anonfun$getMapSizesForReduceId$1(OptimizeSkewedJoin.scala:167)\r\n    at org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin.$anonfun$getMapSizesForReduceId$1$adapted(OptimizeSkewedJoin.scala:167)\r\n    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)\r\n    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)\r\n    ....\r\n    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\r\n    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\r\n    at java.lang.reflect.Method.invoke(Method.java:498)\r\n    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\r\n    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)\r\n    at py4j.Gateway.invoke(Gateway.java:295)\r\n    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\r\n    at py4j.commands.CallCommand.execute(CallCommand.java:79)\r\n    at py4j.GatewayConnection.run(GatewayConnection.java:251)\r\n    at java.lang.Thread.run(Thread.java:748)\nCause\nThis error can occur if adaptive query execution (AQE) (\nAWS\n|\nAzure\n) is enabled and you are joining data. If AQE is enabled, skew join is also enabled.\nIf any of the shuffle data fails due to a cluster scaling down event it generates a\nNullPointerException\nerror.\nSolution\nSet\nspark.sql.adaptive.skewJoin.enabled\nto\nfalse\nin your\nSpark config\n(\nAWS\n|\nAzure\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/invalid-access-token-airflow.json b/scraped_kb_articles/invalid-access-token-airflow.json new file mode 100644 index 0000000000000000000000000000000000000000..a65037983b679d88c732ac6196c5ab35c19db54e --- /dev/null +++ b/scraped_kb_articles/invalid-access-token-airflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/invalid-access-token-airflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you run scheduled Airflow Databricks jobs, you get this error:\nInvalid Access Token : 403 Forbidden Error\nCause\nTo run or schedule Databricks jobs through Airflow, you need to configure the Databricks connection using the Airflow web UI. Any of the following incorrect settings can cause the error:\nSet the\nhost\nfield to the Databricks workspace hostname.\nSet the\nlogin\nfield to token.\nSet the\npassword\nfield to the Databricks-generated personal access token.\nSet the\nExtra\nfield to a JSON string, where the key is\ntoken\nand the value is your personal access token.\nThe Databricks-generated personal access token is normally valid for 90 days. If the token expires, then this\n403 Forbidden Error\noccurs.\nSolution\nVerify that the\nExtra\nfield is correctly configured with the JSON string:\n{\"token\": \"\"}\nVerify that the token is mentioned in both the\npassword\nfield and the\nExtra\nfield.\nVerify that the\nhost\n,\nlogin\n, and\npassword\nfields are configured correctly.\nVerify that the personal access token has not expired.\nIf necessary, generate a new token (\nAWS\n|\nAzure\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/invalid-cron-syntax-error-when-scheduling-multiple-values-in-a-jobs-day-of-week-field.json b/scraped_kb_articles/invalid-cron-syntax-error-when-scheduling-multiple-values-in-a-jobs-day-of-week-field.json new file mode 100644 index 0000000000000000000000000000000000000000..266dc3244a16e298894719acd8ca02d02580fbc0 --- /dev/null +++ b/scraped_kb_articles/invalid-cron-syntax-error-when-scheduling-multiple-values-in-a-jobs-day-of-week-field.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/invalid-cron-syntax-error-when-scheduling-multiple-values-in-a-jobs-day-of-week-field", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to schedule a job using the cron expression\n(0 0 14 ? * x#y,x#y)\nwhere\nx\nis the day of the week and\ny\nis the occurrence of that day in the month (first, second, and so on). As an example, you might write\n(0 0 14 ? * 3#1,3#3)\nto schedule the job for the first and third Tuesday of the month.\nWhen you try to run your job using this expression, you receive an error.\n\"Invalid Quartz Cron Expression.\"\nCause\nQuartz allows only one\nx#y\nexpression in the day-of-week field. A cron expression with two\n#\nexpressions in the same field is invalid syntax and rejected by the scheduler.\nSolution\nUse job linkage to manage each day-of-week requirement individually.\nSchedule the first job to run on the first desired day of the month using the valid cron expression:\n0 0 14 ? * x#y\n.\nFor quick reference:\nSunday:\n1\nMonday:\n2\nTuesday:\n3\nWednesday:\n4\nThursday:\n5\nFriday:\n6\nSaturday:\n7\nThen configure a separate job to execute the first job on the second desired day of the month." +} \ No newline at end of file diff --git a/scraped_kb_articles/invalid_parameter_value-error-when-creating-a-google-vertex-ai-serving-endpoint.json b/scraped_kb_articles/invalid_parameter_value-error-when-creating-a-google-vertex-ai-serving-endpoint.json new file mode 100644 index 0000000000000000000000000000000000000000..8afc5b5a7765b28231212bfb09583f8c4ba60496 --- /dev/null +++ b/scraped_kb_articles/invalid_parameter_value-error-when-creating-a-google-vertex-ai-serving-endpoint.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/invalid_parameter_value-error-when-creating-a-google-vertex-ai-serving-endpoint", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to create a serving endpoint with the Google Vertex AI provider when you get a\nFailed to generate access token for Google VertexAI.\nerror message.\n{'error_code':'INVALID_PARAMETER_VALUE','message':'INVALID_PARAMETER_VALUE: Failed to generate access token for Google VertexAI. Please check the private key for the external model gemini-1-5-prop.'}\nCause\nThis can happen when you do not use the entire Google Vertex AI private key during the setup process on Databricks.\nThe private key is the entire JSON block that is required to connect to Google Vertex AI. The JSON block contains individual attributes called\nprivate_key\nand\nprivate_key_id\n. These attributes are only part of the private key. If you attempt to use these attributes as the private key instead of the entire JSON block, it generates an error.\nSolution\nEnsure you are using the entire private key. If it still does not work, generate a new private key and try again.\nFor more information on how to use the private key to configure the external model, review the Databricks\nGoogle Cloud Vertex AI configuration parameters\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nUse the entire private key\nUse the entire JSON block when configuring the connection.\nThis includes all attributes and not just the\nprivate_key\nattribute segment of the JSON structure.\nGenerate a new private key\nIf your private key does not work, you may need to generate a new one.\nFollow the steps in the Google Cloud\nCreate a service account key\ndocumentation to generate a new private key." +} \ No newline at end of file diff --git a/scraped_kb_articles/invalid_parameter_value-error-when-trying-to-access-a-table-or-view-with-fine-grained-access-control.json b/scraped_kb_articles/invalid_parameter_value-error-when-trying-to-access-a-table-or-view-with-fine-grained-access-control.json new file mode 100644 index 0000000000000000000000000000000000000000..aafdc7e73d9e7a1d1c3ffecaadc27f945800afc0 --- /dev/null +++ b/scraped_kb_articles/invalid_parameter_value-error-when-trying-to-access-a-table-or-view-with-fine-grained-access-control.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/invalid_parameter_value-error-when-trying-to-access-a-table-or-view-with-fine-grained-access-control", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen querying data in Databricks Runtime 15.3 or below on a dedicated (formerly: single user) cluster, you try to access an underlying table or view with fine-grained access control and receive an error message.\nINVALID_PARAMETER_VALUE.ROW_COLUMN_ACCESS_POLICIES_NOT_SUPPORTED_ON_ASSIGNED_CLUSTERS]\nCause\nUsing Databricks Runtime 15.3 or below on a dedicated (formerly: single user) cluster to query a table with fine-grained access control is not supported.\nFor more information, refer to the\nFilter sensitive table data using row filters and column masks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nUpgrade the cluster's Databricks Runtime version to 15.4 LTS or above." +} \ No newline at end of file diff --git a/scraped_kb_articles/invalid_parameter_valuelocation_overlap-overlaps-with-managed-storage-error.json b/scraped_kb_articles/invalid_parameter_valuelocation_overlap-overlaps-with-managed-storage-error.json new file mode 100644 index 0000000000000000000000000000000000000000..eafe7d4a600236f232631dbd53104356d3f90fa7 --- /dev/null +++ b/scraped_kb_articles/invalid_parameter_valuelocation_overlap-overlaps-with-managed-storage-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/invalid_parameter_valuelocation_overlap-overlaps-with-managed-storage-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using\ndbutils\nto access an\nexternal location\n(\nAWS\n|\nAzure\n|\nGCP\n) that is mounted on managed tables in a shared cluster. When you try to list the path to the location, it fails with an\nINVALID_PARAMETER_VALUE.LOCATION_OVERLAP\nerror message.\nThe error says the given path overlaps with managed storage.\ndbutils.fs.ls(\"://path/\")\r\n\r\nAnalysisException: [RequestId=96dd6185-e0dc-4fe0-94ad-bd8ab05fbd8e ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url '://path' overlaps with managed storage\nCause\nRunning the list command on a managed directory is not supported in Unity Catalog. Catalog/schema storage locations are reserved for managed storage.\nSolution\nExternal tables cannot overlap with catalog/schema storage locations, but they can be created under a subdirectory of the root location. You should not create an external table at or above the root location used for the catalog/schema.\nFor example, assume the root location is\n://\n. The corresponding catalog/schema location is equivalent to the managed storage location which is\n:///__unitystorage/catalogs/\n.\nYou can create an external location under\nsome-root/\nas long as it does not overlap the managed table. Given the example,\n:////\nis a valid path for an external location.\nIf you try to list the contents of this example location, the result would be successful." +} \ No newline at end of file diff --git a/scraped_kb_articles/invalidschemaexception-error-when-trying-to-insert-data-into-a-delta-table.json b/scraped_kb_articles/invalidschemaexception-error-when-trying-to-insert-data-into-a-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..1269be5ae7ae86045aba62019ec5c6af21c0a836 --- /dev/null +++ b/scraped_kb_articles/invalidschemaexception-error-when-trying-to-insert-data-into-a-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/invalidschemaexception-error-when-trying-to-insert-data-into-a-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen inserting data into a Delta table with a schema that contains a\nStructField\nof type\nNULL\n, you encounter an\nInvalidSchemaException\n.\nExample Error Message\nJob aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 22) (10.101.191.43 executor 0): org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group {}\nCause\nEmpty STRUCT fields are not permitted in Parquet format.\nThe issue arises when a StructField is defined with an empty StructType. In the following example, the\ncol3\nfield is defined as a STRUCT with no fields.\nfrom pyspark.sql.types import StructType, StructField, FloatType\r\nschema = StructType([\r\n StructField(\"col1\", FloatType(), nullable=True),\r\n StructField(\"col2\", FloatType(), nullable=True),\r\n StructField(\"col3\", StructType([]), nullable=True)\r\n])\nSolution\nDefine a field type for any fields that use a\nStructType\nwithin a\nStructField\n.\nExample\nschema = StructType([\r\n StructField(\"col1\", FloatType(), nullable=True),\r\n StructField(\"col2\", FloatType(), nullable=True),\r\n StructField(\"col3\", StructType([StructField(\"nested_col\",\r\nStringType())]), nullable=True)\r\n])\nFor more information, refer to the\nWhat is a view?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/ip-access-list-update-error.json b/scraped_kb_articles/ip-access-list-update-error.json new file mode 100644 index 0000000000000000000000000000000000000000..245f30cfc3763bfb1bd9b703a82c48b9b61d473b --- /dev/null +++ b/scraped_kb_articles/ip-access-list-update-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/ip-access-list-update-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to\nupdate an IP access list\nand you get an\nINVALID_STATE\nerror message.\n{\"error_code\":\"INVALID_STATE\",\"message\":\"Your current IP 3.3.3.3 will not be allowed to access the workspace under current configuration\"}\nCause\nThe IP access list update that you are trying to commit does not include your current public IP address. If your current IP address is not included in the access list, you are blocked from the environment.\nIf you assume that your current IP is 3.3.3.3, this example API call results in an\nINVALID_STATE\nerror message.\n%sh\r\ncurl -X POST -n \\\r\n  https:///api/2.0/ip-access-lists\r\n  -d '{\r\n    \"label\": \"office\",\r\n    \"list_type\": \"ALLOW\",\r\n    \"ip_addresses\": [\r\n        \"1.1.1.1\",\r\n        \"2.2.2.2/21\"\r\n      ]\r\n    }'\nSolution\nYou must always include your current public IP address in the JSON file that is used to update the IP access list.\nIf you assume that your current IP is 3.3.3.3, this example API call results in a successful IP access list update.\n%sh\r\ncurl -X POST -n \\\r\n  https:///api/2.0/ip-access-lists\r\n  -d '{\r\n    \"label\": \"office\",\r\n    \"list_type\": \"ALLOW\",\r\n    \"ip_addresses\": [\r\n        \"1.1.1.1\",\r\n        \"2.2.2.2/21\",\r\n        \"3.3.3.3\"\r\n      ]\r\n    }'" +} \ No newline at end of file diff --git a/scraped_kb_articles/iterate-through-all-jobs-in-the-workspace-using-jobs-api-21.json b/scraped_kb_articles/iterate-through-all-jobs-in-the-workspace-using-jobs-api-21.json new file mode 100644 index 0000000000000000000000000000000000000000..972c1b832346fc7c63a64e4a23e95ced8fed1d40 --- /dev/null +++ b/scraped_kb_articles/iterate-through-all-jobs-in-the-workspace-using-jobs-api-21.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/iterate-through-all-jobs-in-the-workspace-using-jobs-api-21", + "title": "Título do Artigo Desconhecido", + "content": "In the Databricks Jobs API 2.0 (\nAWS\n|\nAzure\n|\nGCP\n)\nlist\nreturns an unbounded number of job descriptions.\nIn the Jobs API 2.1 (\nAWS\n|\nAzure\n|\nGCP\n), this behavior has changed. The\nlist\ncommand now returns a maximum of 25 jobs, from newest to oldest, at a time.\nIn this article we show you how to manually iterate through all of the jobs in your workspace.\nInstructions\n1) Determine the total number of jobs in your workspace\nClick\nWorkflows\nin the sidebar.\nScroll to the bottom of the page.\nThe total number of jobs in the workspace is listed in the bottom right.\n2) Determine the values to use for\noffset\nand\nlimit\nThe\nlist\ncommand has two modifiers,\nlimit\nand\noffset\n.\noffset\ndetermines the number of jobs that are skipped before the first one is displayed.\nlimit\ndetermines the number of jobs (up to 25) that are displayed. By using the commands together you can display specific jobs out of the total.\nFor example, if there are 20 total jobs in the workspace and you specify a\nlimit\nof 10 and an\noffset\nof 0,\nlist\nreturns jobs 1-10 (the 10 most recent jobs created, not the most recent job runs). Alternatively, if you specify a\nlimit\nof 10 and an\noffset\nof 10,\nlist\nreturns jobs 11-20.\nYou should consider the total number of jobs in your workspace and choose values for\nlimit\nand\noffset\nthat allow you to easily iterate through the total number of jobs.\n3) Iterate through the jobs\nYou need to iterate through the total number of jobs. For this article, we are iterating through all of the jobs in a notebook, using\ncurl\nto access the API. We are assuming the list of jobs is large and are displaying the maximum of 25 at a time.\nReview the Authentication using Databricks personal access tokens (\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information on creating and using personal-access-tokens.\nAs first call:\n%sh\r\ncurl --location --header 'Authorization: Bearer ' --request GET / 'https:///api/2.1/jobs/list?limit=25'\nThe first run uses limit=25.\nAnd as subsequent calls, you can run the below syntax:\ncurl --location --header 'Authorization: Bearer '  --request GET / 'https:///api/2.1/jobs/list?limit=25&page_token='\nYou can continue to iterate through the total number of jobs, displaying 25 at a time, until all of the jobs have been displayed.\n4) Use\njq\nto filter results\nDelete\nInfo\njq\ncan be described as \"\nsed\nfor JSON data\". You can use it to slice, filter, map, and transform structured data.\nYou can use\njq\nto help filter for specific results. For example, if you pipe the results of your\nlist\nrequest through\njq '.deb'\n, it returns objects with a value for the key\ndeb\n.\n%sh\r\ncurl --location --header 'Authorization: Bearer '  --request GET 'https:///api/2.1/jobs/list?limit=25&offset=0' | jq '.deb'\nYou can include multiple keys when using\njq\n. For example,\njq '.deb, .last_updated'\nreturns jobs with values for both of the keys.\n%sh\r\ncurl --location --header 'Authorization: Bearer '  --request GET 'https:///api/2.1/jobs/list?limit=25&offset=0' | jq '.deb, .last_updated’" +} \ No newline at end of file diff --git a/scraped_kb_articles/javalangillegalargumentexception-requirement-failed-partitionsxpartition-==-y-but-it-should-equal-z.json b/scraped_kb_articles/javalangillegalargumentexception-requirement-failed-partitionsxpartition-==-y-but-it-should-equal-z.json new file mode 100644 index 0000000000000000000000000000000000000000..7a96f9499e544deb7174507f062df1323b01f7ec --- /dev/null +++ b/scraped_kb_articles/javalangillegalargumentexception-requirement-failed-partitionsxpartition-==-y-but-it-should-equal-z.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/javalangillegalargumentexception-requirement-failed-partitionsxpartition-==-y-but-it-should-equal-z", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a streaming job that was reading an Event Hubs source and failed with an illegal argument exception error about incorrect partitions.\njava.lang.IllegalArgumentException: requirement failed: partitions(2).partition == 3, but it should equal 2\nCause\nThe stream was reading an Event Hubs topic with 4 partitions (part-0, part-1, part-2, part-3). The error message indicates that the RDD partition at index 2 is part-3 but the expected partition is part-2.\nThe root cause was an issue in the Azure Event Hubs connector. The process caused a mismatch in the number of RDD partitions calculated between the ongoing RDD and the checkpoint metadata. When a partition offset rebalancing occurred and the sequence number was lagging and marked as expired by the Event Hubs connector, it led to a reconfiguration of the Event Hubs state which was incorrect.\nSolution\nYou need to upgrade your Event Hubs connector to version 2.3.19 or above. This version contains\na fix for the issue\n. Versions below 2.3.19 do not contain the fix.\nDatabricks recommends upgrading to Event Hubs connector version 2.3.21 or above.\nFor more information, please review the\nAzure Event Hubs connector release notes\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/javalangoutofmemoryerror-error-when-using-collect-from-sparklyr.json b/scraped_kb_articles/javalangoutofmemoryerror-error-when-using-collect-from-sparklyr.json new file mode 100644 index 0000000000000000000000000000000000000000..c5d71788c360515616d3625674911499814a2c78 --- /dev/null +++ b/scraped_kb_articles/javalangoutofmemoryerror-error-when-using-collect-from-sparklyr.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/javalangoutofmemoryerror-error-when-using-collect-from-sparklyr", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to use\ncollect()\nusing the\nsparklyr\npackage to collect large results from Apache Spark into an R session, you get a\njava.lang.OutOfMemoryError\nerror message.\nCause\nSpark has a 2GB limit on the amount of data that can be collected in an R session. When the dataset exceeds this limit, Spark generates an out of memory error message. This limit is not configurable on the Spark side so you must use an alternative solution to collect larger datasets.\nSolution\nOne way to avoid the out of memory error is by using the\narrow_collect()\nfunction from\nsparklyr\n. By using\narrow_collect()\nyou can collect data incrementally by using a callback function.\nYou can write a custom function,\ncollect_result()\n, that leverages this functionality to collect large datasets.\nThis custom function works by specifying a callback function that appends each batch of data to a list, which is then combined using\nrbindlist()\nfrom the\ndata.table\npackage.\nTo use this custom function, replace\ncollect()\nwith\ncollect_result()\nin your code\nresults <- tbl(sc, \"df\") %>%\r\n collect_result()\nExample collect_result() code\n%r\r\n\r\ncollect_result <- function(tbl, ...) {\r\n collected <- list()\r\n sparklyr:::arrow_collect(tbl, ..., callback = function(batch_df) {\r\n collected <<- c(collected, list(batch_df))\r\n })\r\n data.table::rbindlist(collected)\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/jdbc-optimize-read.json b/scraped_kb_articles/jdbc-optimize-read.json new file mode 100644 index 0000000000000000000000000000000000000000..77597be570f1cf944800589d1f1fb26e2933f1c8 --- /dev/null +++ b/scraped_kb_articles/jdbc-optimize-read.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/jdbc-optimize-read", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nReading data from an external JDBC database is slow. How can I improve read performance?\nSolution\nSee the detailed discussion in the Databricks documentation on how to optimize performance when reading data (\nAWS\n|\nAzure\n|\nGCP\n) from an external JDBC database." +} \ No newline at end of file diff --git a/scraped_kb_articles/jdbc-write-fails-primarykeyviolation.json b/scraped_kb_articles/jdbc-write-fails-primarykeyviolation.json new file mode 100644 index 0000000000000000000000000000000000000000..4adf54e4d28688efe1c4ba136692532b773d5489 --- /dev/null +++ b/scraped_kb_articles/jdbc-write-fails-primarykeyviolation.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/jdbc-write-fails-primarykeyviolation", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a\nPrimaryKeyViolation\nerror.\nAlternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables.\nCause\nWhen Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.\nThe\nPrimaryKeyViolation\nerror occurs when a write operation is attempting to insert a duplicate entry for the primary key.\nSolution\nYou should use a temporary table to buffer the write, and ensure there is no duplicate data.\nVerify that speculative execution is disabled in your Spark configuration:\nspark.speculation false\n. This is disabled by default.\nCreate a temporary table on your SQL database.\nModify your Spark code to write to the temporary table.\nAfter the Spark writes have completed, check the temporary table to ensure there is no duplicate data.\nMerge the temporary table with the target table on your SQL database.\nDelete the temporary table.\nDelete\nInfo\nThis workaround should only be used if you encounter the listed data duplication issue, as there is a small performance penalty when compared to Spark jobs that write directly to the target table." +} \ No newline at end of file diff --git a/scraped_kb_articles/jdbc-write-operation-fails-with-hivesqlexception-error-the-background-threadpool-cannot-accept-new-task-for-execution.json b/scraped_kb_articles/jdbc-write-operation-fails-with-hivesqlexception-error-the-background-threadpool-cannot-accept-new-task-for-execution.json new file mode 100644 index 0000000000000000000000000000000000000000..d76f9dca1f174a790a8a46983effc9db9178ea51 --- /dev/null +++ b/scraped_kb_articles/jdbc-write-operation-fails-with-hivesqlexception-error-the-background-threadpool-cannot-accept-new-task-for-execution.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/jdbc-write-operation-fails-with-hivesqlexception-error-the-background-threadpool-cannot-accept-new-task-for-execution", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running the JDBC write operation using the Databricks JDBC driver, your job fails with the following error.\nCaused by: com.databricks.client.support.exceptions.ErrorException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, infoMessages:[*org.apache.hive.service.cli.HiveSQLException: The background threadpool cannot accept new task for execution, please retry the operation. ])\nCause\nThe number of concurrent asynchronous queries exceeds 100. By default, the driver cannot handle more than 100 Thrift/JDBC connections, leading to resource exhaustion in the background thread pool.\nSolution\nIn your cluster settings:\nScroll to\nAdvanced options\nand click to expand.\nIn the\nSpark\ntab, set the following configurations in the\nSpark config\nbox to increase the concurrent query limit.\nspark.hive.server2.async.exec.threads 200\r\nspark.hive.server2.async.exec.wait.queue.size 200\r\nspark.hive.server2.async.exec.keepalive.time 20\nThese settings increase the number of threads and queue capacity for asynchronous execution, helping prevent thread pool exhaustion." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-cluster-limit-nb-output.json b/scraped_kb_articles/job-cluster-limit-nb-output.json new file mode 100644 index 0000000000000000000000000000000000000000..5378fcc90d5ff0be9e2217cf1ac0fa9e8957e8cf --- /dev/null +++ b/scraped_kb_articles/job-cluster-limit-nb-output.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-cluster-limit-nb-output", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running a notebook on a job cluster and you get an error message indicating that the output is too large.\nThe output of the notebook is too large. Cause: rpc response (of 20975548 bytes) exceeds limit of 20971520 bytes\nCause\nThis error message can occur in a job cluster whenever the notebook output is greater then 20 MB.\nIf you are using multiple\ndisplay()\n,\ndisplayHTML()\n,\nshow()\ncommands in your notebook, this increases the amount of output. Once the output exceeds 20 MB, the error occurs.\nIf you are using multiple\nprint()\ncommands in your notebook, this can increase the output to\nstdout\n. Once the output exceeds 20 MB, the error occurs.\nIf you are running a streaming job and enable\nawaitAnyTermination\nin the cluster’s\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n), it tries to fetch the entire output in a single request. If this exceeds 20 MB, the error occurs.\nSolution\nRemove any unnecessary\ndisplay()\n,\ndisplayHTML()\n,\nprint()\n, and\nshow()\n, commands in your notebook. These can be useful for debugging, but they are not recommended for production jobs.\nIf your job output is exceeding the 20 MB limit, try redirecting your logs to\nlog4j\nor disable\nstdout\nby setting\nspark.databricks.driver.disableScalaOutput true\nin the cluster’s\nSpark config\n.\nFor more information, please review the documentation on output size limits (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-execution-returning-udf_max_count_exceeded-error.json b/scraped_kb_articles/job-execution-returning-udf_max_count_exceeded-error.json new file mode 100644 index 0000000000000000000000000000000000000000..028d2a4d0d39f9256d8be7d2b939081d57559dbe --- /dev/null +++ b/scraped_kb_articles/job-execution-returning-udf_max_count_exceeded-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/job-execution-returning-udf_max_count_exceeded-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen executing a job that uses more than five user-defined functions (UDFs) in Databricks Runtime versions below 14.3 LTS, you encounter the following error, where X is greater than five.\nSparkRuntimeException: [UDF_MAX_COUNT_EXCEEDED] Exceeded query-wide UDF limit of 5 UDFs (limited during public preview). Found X. The UDFs were: .\nCause\nDatabricks Runtime versions below 14.3 LTS enforce a maximum of five UDFs per query plan on Unity Catalog Python and PySpark UDFs, to mitigate the risk of out-of-memory (OOM) errors, especially on shared clusters.\nSolution\nIf your workload requires more than five UDFs, you can increase the UDF limit by adjusting the cluster configuration, but be aware that increasing the limit may increase the risk of OOM errors.\nDatabricks also recommends adjusting memory tracking settings to optimize resource usage.\n1. Navigate to your target cluster and click\nEdit\n.\n2. Scroll to\nAdvanced options\nand click to expand.\n3. Under the\nSpark\ntab, in the\nSpark config\nfield, add the following Apache Spark configurations.\nspark.databricks.safespark.externalUDF.plan.limit 20\r\nspark.databricks.safespark.sandbox.trackMemory.enabled false\r\nspark.databricks.safespark.sandbox.size.default.mib 500\n4. Restart the cluster for the changes to take effect." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-executions-failing-on-clusters-using-docker-container-services-with-malformedinputexception-error.json b/scraped_kb_articles/job-executions-failing-on-clusters-using-docker-container-services-with-malformedinputexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..e4fdb5e0d1e7b7a2bfd0511f258b9e924161cc9c --- /dev/null +++ b/scraped_kb_articles/job-executions-failing-on-clusters-using-docker-container-services-with-malformedinputexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/job-executions-failing-on-clusters-using-docker-container-services-with-malformedinputexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile running jobs on clusters using Docker services, you get an error message.\njava.nio.charset.MalformedInputException: Input length = 1.\nCause\nThe\njava.nio.charset\npart of the error indicates character encoding issues. The Java runtime is not able to properly decode the input data with a given character set or the input is improperly formatted. For example, if the input data is in UTF-8 encoding but the runtime environment is expecting a different encoding like ASCII or Latin-1, this mismatch can cause the decoding to fail.\nThe Java runtime’s inability to decode the input data can also occur due to changes in the incoming data format that are not suitable for the existing environment configurations, or the Java runtime’s LANG settings get reset with chauffeur restarts (which, in turn, restart the driver).\nSolution\nSpecify the correct character encoding when reading the file.\nExample with PySpark\n```python\r\ndf = spark.read.option(\"charset\", \"\").csv(\"path/to/your/file.csv\")\r\n```\nExample with Scala\n```scala\r\nval df = spark.read.option(\"charset\", \"\").csv(\"path/to/your/file.csv\")\r\n```\nAdditionally, set the following environment variables section in your cluster configuration page to ensure consistent character encoding.\nLANG=C.UTF-8\r\nLC_ALL=C.UTF-8\nSetting\nLANG=C.UTF-8\nand\nLC_ALL=C.UTF-8\nconfigures the default setting of the cluster using Docker services to use UTF-8 encoding, helping resolve character encoding and malformed input issues in Java processes.\nPreventative measures\nAlways specify the character encoding when reading files, especially if you are unsure about the format of the incoming data.\nRegularly review and update your cluster configurations to ensure they are compatible with the data being processed.\nMonitor your Databricks environment for any changes in the incoming data format and adjust your configurations accordingly." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-failing-when-trying-to-write-dataset-to-external-path.json b/scraped_kb_articles/job-failing-when-trying-to-write-dataset-to-external-path.json new file mode 100644 index 0000000000000000000000000000000000000000..2dc5ed4281f57eb003734798d57ff3a925271c2d --- /dev/null +++ b/scraped_kb_articles/job-failing-when-trying-to-write-dataset-to-external-path.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/job-failing-when-trying-to-write-dataset-to-external-path", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to write a dataset with an external path, your job fails.\nQuery\nCREATE TABLE IF NOT EXISTS .. USING DELTA LOCATION '{location}';\nError\nUnityCatalogServiceException: [RequestId=934c88aa-3f54-4ed3-ba30-830265560ae6 ErrorClass=INVALID_PARAMETER_VALUE.INVALID_PARAMETER_VALUE] Input path s3://path/preview overlaps with other external tables or volumes. Conflicting tables/volumes:\nCause\nTwo tables cannot be created in the same location, even across different catalogs. Either there is already an external table defined on the path, or the input path overlaps with other external tables or volumes.\nSolution\nClarify which tables or volumes are using which location on a selected schema. Identify the location of each table and ensure that you are not using the same path for multiple tables.\nfrom pyspark.sql.functions import col\r\ndf = spark.sql(\"SHOW tables\").collect()\r\nfor row in df:\r\ntable = row[\"tableName\"]\r\ndisplay(table)\r\ntable_detail =spark.sql(\"describe table extended {}\".format (table))\r\ntable_detail = table_detail.filter(col(\"col_name\") == \"Location\")\r\ndisplay(table_detail)\nOptionally, to get a volume path, use the following command.\nDESCRIBE volume ;\nThen use the\nSHALLOW CLONE\ncommand to create a new table with the same schema and data as an existing table, but with a different location.\nFor more information on managing external tables, review the\nExternal tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor more information on using\nSHALLOW CLONE\n, refer to the\nShallow clone for Unity Catalog tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-failing-with-delta_change_data_file_not_found-error.json b/scraped_kb_articles/job-failing-with-delta_change_data_file_not_found-error.json new file mode 100644 index 0000000000000000000000000000000000000000..18057354a4e76567d0c3c939dc58e06ac1d26581 --- /dev/null +++ b/scraped_kb_articles/job-failing-with-delta_change_data_file_not_found-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/job-failing-with-delta_change_data_file_not_found-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Delta Lake and change data feed (CDF) features, you encounter a file not found error message.\npyspark.errors.exceptions.connect.SparkException: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file .snappy.parquet. [DELTA_CHANGE_DATA_FILE_NOT_FOUND] File .snappy.parquet referenced in the transaction log cannot be found.\nCause\nThe error message typically includes the following explanation.\nThis can occur when data has been manually deleted from the file system rather than using the table `DELETE` statement. This request appears to be targeting Change Data Feed, if that is the case, this error can occur when the change data file is out of the retention period and has been deleted by the `VACUUM` statement.\nSolution\nStart your stream with a new checkpoint location. Setting a new checkpoint location allows you to resume streaming from the beginning and avoid getting stuck on missing files. Ensure you make a backup of your older checkpoint location first.\nIf you don't want to use a new checkpoint location, add the following configuration to your cluster.\nNavigate to your cluster and click on it to open the settings.\nScroll down to\nAdvanced options\nand click to expand.\nUnder the\nSpark\ntab, in the\nSpark config\nbox, enter the following code.\nspark.sql.files.ignoreMissingFiles true\nImportant\nDatabricks recommends that once the stream has recovered, you remove the\nspark.sql.files.ignoreMissingFiles\nflag to ensure that any new missing file errors are not skipped without notification.\nFor more information on\n.ignoremissingfiles\n, refer to the Apache Spark\nGeneric File Source Options\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-adls-hour.json b/scraped_kb_articles/job-fails-adls-hour.json new file mode 100644 index 0000000000000000000000000000000000000000..5c64ca76d851d2526915f738cdbe21226d152dc4 --- /dev/null +++ b/scraped_kb_articles/job-fails-adls-hour.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/job-fails-adls-hour", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Azure Active Directory (Azure AD) credential passthrough to access Azure Data Lake Storage (ADLS) resources.\nJobs that run longer than one hour fail with a\nHTTP401\nerror message.\ncom.microsoft.azure.datalake.store.ADLException: Error reading from file /local/Users/\r\n\r\nOperation OPEN failed with HTTP401 : null\r\nLast encountered exception thrown after 5 tries. [HTTP401(null),HTTP401(null),HTTP401(null),HTTP401(null),HTTP401(null)]\nCause\nThe lifetime of an Azure AD passthrough token is one hour. When a command is sent to the cluster that takes longer than one hour, it fails if an ADLS resource is accessed after the one hour mark.\nThis is a known issue.\nSolution\nYou must rewrite your queries, so that no single command takes longer than an hour to complete.\nIt is not possible to increase the lifetime of an Azure AD passthrough token. The token is retrieved by the Azure Databricks replicated principal. You cannot edit its properties.\nPlease review the\nADLS credential passthrough limitations\ndocumentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-invalid-access-token.json b/scraped_kb_articles/job-fails-invalid-access-token.json new file mode 100644 index 0000000000000000000000000000000000000000..055ead6776759e95e8b7c499aac1d0da4f42e71f --- /dev/null +++ b/scraped_kb_articles/job-fails-invalid-access-token.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-fails-invalid-access-token", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nLong running jobs, such as streaming jobs, fail after 48 hours when using\ndbutils.secrets.get()\n(\nAWS\n|\nAzure\n|\nGCP\n).\nFor example:\n%python\r\n\r\nstreamingInputDF1 = (\r\n     spark\r\n    .readStream                       \r\n    .format(\"delta\")               \r\n    .table(\"default.delta_sorce\")\r\n  )\r\n\r\n\r\ndef writeIntodelta(batchDF, batchId):\r\n  table_name = dbutils.secrets.get(\"secret1\",\"table_name\")\r\n  batchDF = batchDF.drop_duplicates()\r\n  batchDF.write.format(\"delta\").mode(\"append\").saveAsTable(table_name)\r\n\r\n\r\nstreamingInputDF1 \\\r\n  .writeStream \\\r\n  .format(\"delta\") \\\r\n  .option(\"checkpointLocation\", \"dbfs:/tmp/delta_to_delta\") \\\r\n  .foreachBatch(writeIntodelta) \\\r\n  .outputMode(\"append\") \\\r\n  .start()\nThis example code returns an error after 48 hours.\n\r\n\r\nError 403 Invalid access token.\r\n\r\n

HTTP ERROR 403

\r\n

Problem accessing /api/2.0/secrets/get. Reason:\r\n

    Invalid access token.

\r\n\nCause\nDatabricks Utilities (dbutils) (\nAWS\n|\nAzure\n|\nGCP\n)  tokens expire after 48 hours.\nThis is by design.\nSolution\nYou cannot extend the life of a token.\nJobs that take more than 48 hours to complete should not use\ndbutils.secrets.get()\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-maxresultsize-exception.json b/scraped_kb_articles/job-fails-maxresultsize-exception.json new file mode 100644 index 0000000000000000000000000000000000000000..83666649cef61d88b112be71b1ce7d2e10e1b206 --- /dev/null +++ b/scraped_kb_articles/job-fails-maxresultsize-exception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-fails-maxresultsize-exception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAn Apache Spark job fails with a\nmaxResultSize\nexception.\norg.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized\r\nresults of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB)\nCause\nThis error occurs because the configured size limit was exceeded. The size limit applies to the total serialized results for Spark actions across all partitions. The Spark actions include actions such as\ncollect()\nto the driver node,\ntoPandas()\n, or saving a large file to the driver local file system.\nSolution\nFirst, refactor your code to prevent the driver node from collecting a large amount of data. Change the code so that the driver node collects a limited amount of data or increase the driver instance memory size.\nIf the workload requires you to collect a large amount of data and it cannot be updated, you can update the property\nspark.driver.maxResultSize\nto a value\n\nGB higher than the value reported in the exception message in the cluster Spark config (\nAWS\n|\nAzure\n|\nGCP\n).\nDelete\nImportant\nSetting\nspark.driver.maxResultSize\nto very high values can lead to OOM errors. Use this config only if necessary.\nIf you are using BI tools (such as PowerBI or Tableau, or your own custom application using a JDBC or ODBC driver) and are facing this issue, enable Cloud Fetch in your application.\nCloud Fetch allows the executors to upload the result sets to the workspace root cloud storage (Databricks File System), and then the driver node generates pre-signed URLs for the result set and sends it to the JDBC or ODBC driver.\nThis workflow allows results to be directly downloaded from cloud storage, avoiding the driver having to use up its memory to collect the results.\nFor more information Cloud Fetch, refer to the\nDriver capability settings for the Databricks ODBC Driver\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-no-library.json b/scraped_kb_articles/job-fails-no-library.json new file mode 100644 index 0000000000000000000000000000000000000000..c566a3e8ad186b5f8ea7c6ab8b7512df66ee3b32 --- /dev/null +++ b/scraped_kb_articles/job-fails-no-library.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-fails-no-library", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA Databricks job fails because the job requires a library that is not yet installed, causing\nImport\nerrors.\nCause\nThe error occurs because the job starts running before required libraries install. If you run a job on a cluster in either of the following situations, the cluster can experience a delay in installing libraries:\nWhen you start an existing cluster with libraries in terminated state.\nWhen you start a new cluster that uses a shared library (a library installed on all clusters).\nSolution\nIf a job requires certain libraries, make sure to attach the libraries as dependent libraries within job itself. Refer to the following article and steps on how to set up dependent libraries when you create a job.\nAdd libraries as dependent libraries when you create a job (\nAWS\n|\nAzure\n).\n1. Open\nAdd Dependent Library\ndialog:\nAWS\nDelete\nAzure\nDelete\n2. Choose library:\nAWS\nDelete\nAzure\nDelete\n3. Verify library:\nAWS\nDelete\nAzure\nDelete" +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-spark-finishes.json b/scraped_kb_articles/job-fails-spark-finishes.json new file mode 100644 index 0000000000000000000000000000000000000000..f89f449e4e95bd6e40be9b91b30b308bdb3aa49b --- /dev/null +++ b/scraped_kb_articles/job-fails-spark-finishes.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-fails-spark-finishes", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Databricks job reports a failed status, but all Spark jobs and tasks have successfully completed.\nCause\nYou have explicitly called\nspark.stop()\nor\nSystem.exit(0)\nin your code.\nIf either of these are called, the Spark context is stopped, but the graceful shutdown and handshake with the Databricks job service does not happen.\nSolution\nDo not call\nspark.stop()\nor\nSystem.exit(0)\nin Spark code that is running on a Databricks cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-throttled-atypical.json b/scraped_kb_articles/job-fails-throttled-atypical.json new file mode 100644 index 0000000000000000000000000000000000000000..0693aec09d7f392b76bad1c881a2c6e5c0fe9d37 --- /dev/null +++ b/scraped_kb_articles/job-fails-throttled-atypical.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-fails-throttled-atypical", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour job run fails with a\nthrottled due to observing atypical errors\nerror message.\nCluster became unreachable during run Cause: xxx-xxxxxx-xxxxxxx is throttled due to observing atypical errors\nCause\nThe jobs on this cluster have returned too many large results to the Apache Spark driver node.\nAs a result, the chauffeur service runs out of memory, and the cluster becomes unreachable.\nThis can happen after calling the\n.collect\nor\n.show\nAPI.\nSolution\nYou can either reduce the workload on the cluster or increase the value of\nspark.memory.chauffeur.size\n.\nThe chauffeur service runs on the same host as the Spark driver. When you allocate more memory to the chauffeur service, less overall memory will be available for the Spark driver.\nSet the value of\nspark.memory.chauffeur.size\n:\nOpen the cluster configuration page in your workspace.\nClick\nEdit\n.\nExpand\nAdvanced Options\n.\nEnter the value of\nspark.memory.chauffeur.size\nin mb in the\nSpark config\nfield.\nClick\nConfirm and Restart\n.\nDelete\nInfo\nThe default value for\nspark.memory.chauffeur.size\nis 1024 megabytes. This is written as\nspark.memory.chauffeur.size 1024mb\nin the Spark configuration. The maximum value is the lesser of 16 GB or 20% of the driver node’s total memory." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-while-installing-odbc-driver-18-for-sql-server-using-an-init-script.json b/scraped_kb_articles/job-fails-while-installing-odbc-driver-18-for-sql-server-using-an-init-script.json new file mode 100644 index 0000000000000000000000000000000000000000..2505300da014fefa314baf87c7cba630df6caf6e --- /dev/null +++ b/scraped_kb_articles/job-fails-while-installing-odbc-driver-18-for-sql-server-using-an-init-script.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/job-fails-while-installing-odbc-driver-18-for-sql-server-using-an-init-script", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen installing ODBC Driver 18 for SQL Server in Databricks compute using an init script, your job intermittently fails with the following error.\nTask 4 in stage 2.0 failed 4 times, most recent failure: Lost task 4.3 in stage 2.0 (TID 31) (10.148.45.37 executor 1): SenzingEngineException{errorCode=-2, input='', senzingError='1000E|Unhandled Database Error '(0:01000[unixODBC][Driver Manager]Can't open lib 'ODBC Driver 18 for SQL Server' : file not found)''}\nCause\nApache Spark executors and the driver manager (unixODBC) can’t find the shared library file for\nmsodbcsql18\nbecause the library is not added to\nLD_LIBRARY_PATH\n.\nWhen\nLD_LIBRARY_PATH\n, an environment variable the dynamic linker in Linux uses to locate shared libraries, is not set correctly the system can’t find the necessary libraries to load.\nSolution\nAdd\nLD_LIBRARY_PATH\nvariable to your init script using the following code.\nThe first line adds\nmsodbcsql18\nto the\nLD_LIBRARY_PATH\nfor the current session, to help any process started after this point (including your Spark executors) locate the ODBC driver.\nThe second line appends\nLD_LIBRARY_PATH\npath to\n/etc/environment\n. Appending ensures:\nthe updated\nLD_LIBRARY_PATH\nis applied system-wide. Any future processes, even if spawned by different users or by different lifecycle events (for example, executor restarts), will inherit the path.\nall executors and drivers on the cluster have access to the correct library path, even if they are restarted or scaled dynamically.\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/microsoft/msodbcsql18/lib64\r\n\r\necho 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/microsoft/msodbcsql18/lib64' >> /etc/environment" +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-executorlostfailure-because-executor-is-busy.json b/scraped_kb_articles/job-fails-with-executorlostfailure-because-executor-is-busy.json new file mode 100644 index 0000000000000000000000000000000000000000..6a0dc114f644cf2cd78eaba65c1338e549d85847 --- /dev/null +++ b/scraped_kb_articles/job-fails-with-executorlostfailure-because-executor-is-busy.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/job-fails-with-executorlostfailure-because-executor-is-busy", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nJob fails with an\nExecutorLostFailure\nerror message.\nExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms\nCause\nThe\nExecutorLostFailure\nerror message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message which can have more than one root cause. In this article, we will look how to resolve issues when the root cause is due to the executor being busy.\nThis can happen if the load is too high and the executor is not able to send a heartbeat signal to the driver within a predefined threshold time. If this happens, the driver considers the executor lost.\nHow do you determine if a high CPU load is the reason for the executor getting lost?\nTo confirm that the executors are busy, you should check the Ganglia metrics and review the CPU loads.\nIf the CPU loads are high:\nyou have too many small files\nyour job may be launching too many API calls/requests\nyou don't have an optimum partition strategy\nyou don't have a compute intensive cluster\nSolution\nHere is what you need to do based on different causes of failure.\nAre there too many small files?\nCompact small files to bigger files. Delta Lake supports\nOPTIMIZE\n(\nAWS\n|\nAzure\n|\nGCP\n) which is used to compact files. Auto optimize on Databricks (\nAWS\n|\nAzure\n|\nGCP\n) automatically compacts small files during writes.\nIf you are not using Delta Lake, you should plan to create larger files before writing data to your tables. This can be achieved by applying\nrepartition()\nbefore writing files to the table location.\nAre there too many API requests?\nAttempt to minimize the number of API calls made by your job. This applies to both external API services and any Databricks REST API calls. One way to do this is by repartitioning the source data so it contains a smaller number of partitions and then making the necessary API calls.\nIt is also good to determine why you have a high number of API requests. For example, if the root cause is too many small files, you should compact as many small files as possible into large files. A few large files is recommended over many small files.\nAre there too many partitions?\nOne common mistake is over-partitioning of data sources. Instances with multiple thousands of partitions is not optimal. Ideally, you should have a small number of partitions that can be processed in parallel by the available cores your cluster.\nEnsure that your partitions have as few levels as possible. For example, instead of using a partition structure like\nyr=/month=/day=/\n, you could reduce the date to a single level. For example,\ndate=yyyy-MM-dd\n.\nAvoid partitioning data based on a column that has high cardinality, like id columns. Instead pick a column that is commonly used in your queries, but has a lower cardinality.\nAre the CPU loads in Ganglia metrics high?\nIf you have high CPU loads.\nIf the cluster utilization is more than 80%.\nIf you notice certain nodes in the Ganglia metrics are red.\nIt means you are not using the right type of cluster for the workload.\nMake sure your selected cluster has enough CPU cores and available memory. You can also try using a\nCompute optimized\nworker type instead of the default\nStorage optimized\nworker type for your cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-executorlostfailure-due-to-out-of-memory-error.json b/scraped_kb_articles/job-fails-with-executorlostfailure-due-to-out-of-memory-error.json new file mode 100644 index 0000000000000000000000000000000000000000..6798f1ba679cc4dac6a85985ba11a37bb642d91f --- /dev/null +++ b/scraped_kb_articles/job-fails-with-executorlostfailure-due-to-out-of-memory-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/job-fails-with-executorlostfailure-due-to-out-of-memory-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nJob fails with an\nExecutorLostFailure\nerror message.\nExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms\nCause\nThe\nExecutorLostFailure\nerror message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message which can have more than one root cause. In this article, we will look how to resolve issues when the root cause is due to the executor running out of memory\nLet's say your executor has too much data to process and the amount of memory available in the executor is not sufficient to process the amount of data, then this issue could occur. For e.g. if the executor in your cluster has 24GB capacity and if the cumulative amount of the data size corresponding to all the tasks that are getting executed on that executor is greater than 24GB, then this issue could occur\nHow do you determine if OOM is the reason for the executor getting lost?\nOpen the Spark UI.\nClick\nStages\n.\nClick\nFailed stages\n.\nClick the\ndescription\nthat corresponds to the failed stage.\nReview the bottom of the\nstage details\npage.\nSort the list of tasks on the\nerror\ncolumn.\nThe error messages describe why a specific task failed. If you see an error message that says\nout of memory\n, or a similar error like\njava.lang.OutOfMemoryError\nit means the task failed because the executor ran out of memory.\nSolution\nWhen an executor is failing due to running out of memory, you should review the following items.\nIs there a data skew?\nCheck whether the data is equally distributed across executors, or if there is any skew in the data.\nYou can find this by checking the\nstage summary\ntable on the\nstage details\npage of the Spark UI.\nIf there is data skew and if this is the only executor that has more data in it, you need to resolve the skew to prevent the executor from running out of memory.\nIn most cases Adaptive Query Execution (AQE) automatically detects data skew and resolves the issue. However, there are some edge cases where AQE may not detect data skew correctly. Please review Why didn’t AQE detect my data skew? (\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nIf you are having trouble resolving data skew, you can try increasing the number of partitions or by explicitly mentioning the skew hints as explained in the\nHow to specify skew hints in dataset and DataFrame-based join commands\narticle.\nA partition is considered skewed when both\n(partition size > skewedPartitionFactor * median partition size)\nand\n(partition size > skewedPartitionThresholdInBytes)\nare true.\nFor example, given a median partition size of 200 MB, if any partition exceeds 1 GB (200 MB * 5 (five is the default\nskewedPartitionFactor\nvalue)), it is considered skewed. Under this example, if you have a partition size of 900 MB it wouldn't be considered as skewed with the default settings.\nNow say your application code does a lot of transformations on the data (like\nexplode\n, cartesian join, etc.). If you are performing a high number of transformations, you can overwhelm the executor, even if the partition isn't normally considered skewed.\nUsing our example defaults, you may find that a 900 MB partition is too much to successfully process. If that is the case, you should reduce the\nskewedPartitionFactor\nvalue. By reducing this value to 4, the system then considers any partition over 800 MB as skewed and automatically assigns the appropriate skew hints.\nPlease review the AQE documentation on dynamically handling skew join (\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nIs the executor capable enough?\nIf data is equally distributed across all executors and you still see out of memory errors, the executor does not have enough resources to handle the load you are trying to run.\nIncrease horizontally by increasing the number of workers and/or increase vertically by selecting a\nWorker type\nwith more memory when creating your clusters.\nIs it a properly configured streaming job?\nIf there is no apparent data skew, but the executor is still getting too much data to process, you should use\nmaxFilesPerTrigger\nand/or the trigger frequency settings to reduce the amount of data that is processed at any one time.\nReducing the load on the executors also helps reduce the memory requirement, at the expense of slightly higher latency. In exchange for the increase in latency, the streaming job processed streaming events in a more controlled manner. A steady flow of events is reliably processed with every micro batch.\nPlease review the\nOptimize streaming transactions with .trigger\narticle for more information. You should also review the Spark Structured Streaming Programming Guide documentation on\ninput sources\nand\ntriggers\n.\nIf you want to increase the speed of the processing, you need to increase the number of executors in your cluster. You can also repartition the input streaming DataFrame, so the number of tasks is less than or equal to the number of cores in the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-executorlostfailure-error-due-to-excessive-garbage-collection-gc.json b/scraped_kb_articles/job-fails-with-executorlostfailure-error-due-to-excessive-garbage-collection-gc.json new file mode 100644 index 0000000000000000000000000000000000000000..b4891b7cb85c27a712632b29b6931626b856d9f0 --- /dev/null +++ b/scraped_kb_articles/job-fails-with-executorlostfailure-error-due-to-excessive-garbage-collection-gc.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/job-fails-with-executorlostfailure-error-due-to-excessive-garbage-collection-gc", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use broadcasting, you experience an\nExecutorLostFailure\nerror.\nExecutorLostFailure (executor <> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <> ms\nFor instructions on addressing this\nExecutorLostFailure\nerror message due to other OOM issues, refer to\nJob fails with ExecutorLostFailure due to “Out of memory” error\n.\nCause\nThis article covers how to address\nExecutorLostFailure\nissues caused by excessive garbage collection on the executor and the use of broadcasting.\nBroadcasting an 8GB table means that the table is sent to every executor. If the executor's memory is insufficient to handle both the broadcast data and other tasks, it can lead to excessive garbage collection (GC) or even an Out of Memory error (OOM), causing the executor to crash. This memory overload can result in heartbeat timeouts, as the executor spends most of its time trying to manage memory and fails to respond to the driver's heartbeat checks.\nSolution\nAvoid data shuffling by broadcasting one of the two tables or DataFrames (the smaller one) that are being joined together. The table is broadcast by the driver, who copies it to all worker nodes.\nWhen executing joins, modify the\nautoBroadcastJoinThreshold\nto a value lower than 8GB preferably 1GB, based on your cluster configuration. Alternatively, remove the parameter and let the adaptive query execution (AQE) decide the best optimizer strategy for your job for better performance.\nspark.databricks.adaptive.autoBroadcastJoinThreshold  \r\nspark.sql.autoBroadcastJoinThreshold  " +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-indexoutofboundsexception-and-arrowbuf-errors.json b/scraped_kb_articles/job-fails-with-indexoutofboundsexception-and-arrowbuf-errors.json new file mode 100644 index 0000000000000000000000000000000000000000..c615d85fefc36ddcd3329300ecc59a0a26731a3e --- /dev/null +++ b/scraped_kb_articles/job-fails-with-indexoutofboundsexception-and-arrowbuf-errors.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/job-fails-with-indexoutofboundsexception-and-arrowbuf-errors", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are getting intermittent job failures with\njava.lang.IndexOutOfBoundsException\nand\nArrowBuf\nerrors.\nExample stack trace\nPy4JJavaError: An error occurred while calling o617.count.\r\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 2195, 10.207.235.228, executor 0): java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))\r\n    at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)\r\n    at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)\r\n    at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)\r\n    at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)\r\n    at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)\r\n    at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)\r\n    at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)\r\n    at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)\r\n    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)\r\n    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\r\n    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)\r\n    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)\r\n    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)\r\n    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)\r\n    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)\r\n\r\n\r\nDriver stacktrace:\r\n    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)\r\n    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)\r\n    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)\r\n    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\r\n    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\r\n    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\r\n    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)\r\n    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)\r\n    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)\r\n    at scala.Option.foreach(Option.scala:407)\r\n    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)\r\n    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)\r\n    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)\r\n    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)\r\n    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\r\nCaused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))\r\n    at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)\r\n    at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)\r\n    at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)\r\n    at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)\r\n    at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)\r\n    at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)\r\n    at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)\r\n    at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)\r\n    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)\r\n    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\r\n    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)\r\n    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)\r\n    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)\r\n    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)\r\n    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)\nCause\nThis is due to an issue in the Apache Arrow buffer size estimation. Apache Arrow is an in-memory columnar data format that is used in spark to efficiently transfer data between JVM and python.\nWhen\nGroupby\nis used with\napplyinPandas\nit can result in this error.\nFor more information, please review the\nARROW-15983\nissue on the Apache site.\nSolution\nThis is a sporadic failure and a retry usually succeeds.\nIf a retry doesn't work, you can workaround the issue by adding the following to your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n):\nspark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled=true\nDelete\nInfo\nEnabling\npandasZeroConfConversion.groupbyApply\nmay result in lower performance, so it should only be used if needed. This should not be a default setting on your cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-inputoutput-error-when-displaying-a-data-frame-.json b/scraped_kb_articles/job-fails-with-inputoutput-error-when-displaying-a-data-frame-.json new file mode 100644 index 0000000000000000000000000000000000000000..469ce4e601cc39a06c41b29e82a2d222f8ba7134 --- /dev/null +++ b/scraped_kb_articles/job-fails-with-inputoutput-error-when-displaying-a-data-frame-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/job-fails-with-inputoutput-error-when-displaying-a-data-frame-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile working in a Data Engineering environment using Apache Spark SQL and Delta Lake, your job fails when attempting to display a data frame using the\ndisplay()\ncommand.\nOSError: [Errno 5] Input/output error: '/Workspace/Repos/dir1/test'.\nCause\nNetwork security group (NSG) configurations restrict certain ports required for internal communications within Databricks workspaces. Specifically, port 1017 is not allowed on the NSG for a workspace, leading to input/output errors when the\ndisplay()\ncommand attempts to access the filesystem.\nSolution\nEnsure that all necessary ports for internal communications are allowed in your NSG for your Databricks workspace.\nSpecifically verify that port 1017 is open for both UDP and TCP traffic.\nRefer to the\nConfigure a customer-managed VPC\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation section on security groups\nto configure your NSG to your needs.\nAfter updating the NSG settings, restart the affected job and verify that the issue is resolved.\nIf your job is still failing, review the backend logs for any additional errors or timeouts and adjust the NSG settings accordingly.\nNote\nFor future prevention, ensure that the NSG configurations are periodically reviewed and updated to accommodate any changes in the Databricks environment or network requirements. Additionally, consider setting up monitoring and alerts for network-related issues to proactively address potential disruptions." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-java-indexoutofboundsexception-error.json b/scraped_kb_articles/job-fails-with-java-indexoutofboundsexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..f62cc4eca4ad9a80c5f13455246b880dbe611faf --- /dev/null +++ b/scraped_kb_articles/job-fails-with-java-indexoutofboundsexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/job-fails-with-java-indexoutofboundsexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour job fails with a Java\nIndexOutOfBoundsException\nerror message:\njava.lang.IndexOutOfBoundsException: index: 0, length: (expected: range(0, 0))\nWhen you review the stack trace you see something similar to this:\nPy4JJavaError: An error occurred while calling o617.count.\r\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 2195, 10.207.235.228, executor 0): java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))\nat io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)\nat io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)\nat org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)\nat org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)\nat org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)\nat org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)\nat org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)\nat org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)\nat org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)\nat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\nat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)\nat org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)\nat org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)\nat org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)\nat org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)\r\n\r\n\r\nDriver stacktrace:\nat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)\nat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)\nat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)\nat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\nat scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\nat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\nat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)\nat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)\nat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)\nat scala.Option.foreach(Option.scala:407)\nat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)\nat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)\nat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)\nat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)\nat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\r\nCaused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))\nat io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)\nat io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)\nat org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)\nat org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)\nat org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)\nat org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)\nat org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)\nat org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)\nat org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)\nat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\nat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)\nat org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)\nat org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)\nat org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)\nat org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)\nCause\nThis error occurs due to an\narrow buffer limitation\n. When\ngroupby()\nis used along with\napplyInPandas\nit results in this error.\nSolution\nYou can work around the issue by setting the following value in your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n):\nspark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled=true\nThis setting allows\ngroupby()\nto function correctly with pandas operations." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-nosuchelementexception-error.json b/scraped_kb_articles/job-fails-with-nosuchelementexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..22a19b9c8f6d23f2859d4f4006ed8c48a6ce8e6b --- /dev/null +++ b/scraped_kb_articles/job-fails-with-nosuchelementexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/job-fails-with-nosuchelementexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are getting intermittent job failures with a\nNoSuchElementException\nerror.\nExample stack trace\nPy4JJavaError: An error occurred while calling o2843.count.\r\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 868.0 failed 4 times, most recent failure: Lost task 17.3 in stage 868.0 (TID 3065) (10.249.38.86 executor 6): java.util.NoSuchElementException\r\nat org.apache.spark.sql.vectorized.ColumnarBatch$1.next(ColumnarBatch.java:69)\r\nat org.apache.spark.sql.vectorized.ColumnarBatch$1.next(ColumnarBatch.java:58)\r\nat scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)\r\nat org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$4.next(ArrowConverters.scala:401)\r\nat org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$4.next(ArrowConverters.scala:382)\r\nat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)\r\n...\nCause\nThe\nNoSuchElementException\nerror is the result of an issue in Apache Arrow optimization. Apache Arrow is an in-memory columnar data format that is used in spark to efficiently transfer data between the JVM and Python.\nWhen Arrow optimization is enabled in the Py4J interface, there is a possibility it can call\niterator.next()\nwithout checking\niterator.hasNext()\n. This can result in a\nNoSuchElementException\nerror.\nSolution\nSet\nspark.databricks.pyspark.emptyArrowBatchCheck\nto\ntrue\nin the cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nspark.databricks.pyspark.emptyArrowBatchCheck=true\nEnabling\nspark.databricks.pyspark.emptyArrowBatchCheck\nprevents a\nNoSuchElementException\nerror from occurring when the Arrow batch size is 0.\nAlternatively, you can disable Arrow optimization by setting the following properties in your cluster's\nSpark config\n.\nspark.sql.execution.arrow.pyspark.enabled=false\r\nspark.sql.execution.arrow.enabled=false\nDisabling Arrow optimization may have performance implications." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-not-enough-memory-to-build-the-hash-map-error.json b/scraped_kb_articles/job-fails-with-not-enough-memory-to-build-the-hash-map-error.json new file mode 100644 index 0000000000000000000000000000000000000000..3dd42e65aacc45b848666e34d8a6831f42c79101 --- /dev/null +++ b/scraped_kb_articles/job-fails-with-not-enough-memory-to-build-the-hash-map-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/job-fails-with-not-enough-memory-to-build-the-hash-map-error", + "title": "Título do Artigo Desconhecido", + "content": "Delete\nInfo\nThis article applies to Databricks Runtime 11.3 LTS and above.\nProblem\nYou are running SparkSQL/PySpark code which uses broadcast hints. It takes longer to run than on previous Databricks Runtimes and/or fails with an out of memory error message.\nExample code:\ndf.join(broadcast(bigDf)).write.mode('overwrite').parquet(\"path\")\nError message:\nJob aborted due to stage failure: There is not enough memory to build the hash map\r\nCaused by: There is not enough memory to build the hash map\nIf you check the execution plan under the SQL tab in the Apache Spark UI, it indicates the failure occurred during an executor broadcast join.\nCause\nExecutor side broadcast join (EBJ) is an enhancement introduced in Databricks Runtime 11.3 LTS. It optimizes the way a broadcast join functions. Previously, broadcast joins relied on the Spark driver to broadcast one side of the join. While this approach worked, it presented the following challenges:\nSingle point of failure:\nThe driver, as the coordinator of all queries in a cluster, faced an increased risk of failure due to JVM out of memory errors when used for broadcasting across all concurrent queries. In such a case, all queries running simultaneously on the cluster could be affected and potentially fail.\nLimited flexibility in increasing default broadcast join thresholds:\nThe risk of out of memory errors on the driver (due to concurrent broadcasts) made it difficult to increase the default broadcast join thresholds (currently 10MB static and 30MB dynamic), as it would add more memory pressure on the driver and elevate the risk of failure.\nThe EBJ enhancement addresses these challenges by shifting the memory pressure from the driver to the executors. This not only eliminates a single point of failure but also allows for increasing the default broadcast join thresholds across the board, thanks to the reduced risk of total cluster failure.\nUnfortunately, with EBJ enabled, queries that use an explicit broadcast hint may now fail. This is more likely with a larger data set.\nThe issue encountered is related to the way memory consumption was managed in the driver-side broadcast. The memory used by the driver-side broadcast is controlled by\nspark.driver.maxResultSize\n. In earlier versions, the memory used for the broadcast was not explicitly tracked and accounted for in the task memory manager. This meant that the task could underestimate the memory being used. As a result, large jobs could seemingly fail randomly, due to out of memory errors, while others would succeed due to luck.\nWith EBJ enabled, the memory allocated for the broadcast hash map is accurately tracked and deducted from the task's memory manager's available memory. This improvement reduces the risk of memory leaks and JVM out of memory errors. As a consequence, large broadcasts may be handled differently, but reliability has improved.\nSolution\nTracking memory usage in EBJ is essential for preventing JVM out of memory errors, which can lead to the failure of all concurrent queries on a node.\nInstead of using broadcast hints, you should let adaptive query execution (\nAWS\n|\nAzure\n|\nGCP\n) select which join is suitable for the workload based on the size of the data being processed." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-fails-with-spark-shuffle-fetchfailedexception-error.json b/scraped_kb_articles/job-fails-with-spark-shuffle-fetchfailedexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..7e218b4e89e6039ece5fc42a11fd10aa430a7043 --- /dev/null +++ b/scraped_kb_articles/job-fails-with-spark-shuffle-fetchfailedexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-fails-with-spark-shuffle-fetchfailedexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIf your application contains any aggregation or join stages, the execution will require a Spark Shuffle stage. Depending on the specific configuration used, if you are running multiple streaming queries on an interactive cluster you may get a shuffle\nFetchFailedException\nerror.\nShuffleMapStage has failed the maximum allowable number of times\r\nDAGScheduler: ShuffleMapStage 499453 (start at command-39573728:13) failed in 468.820 s due to\r\norg.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 228703\r\norg.apache.spark.shuffle.FetchFailedException: Connection reset by peer\r\nat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:747)\r\nCaused by: java.io.IOException: Connection reset by peer\nCause\nShuffle fetch failures usually occur during scenarios such as cluster downscaling events, executor loss, or worker decommission. In certain cases, shuffle files from the executor are lost. When a subsequent task tries to fetch the shuffle files, it fails.\nThe shuffle service is enabled by default in Databricks. This service enables an external shuffle service that preserves the shuffle files written by executors so the executors can be safely removed.\nRun\nspark.conf.get(\"spark.shuffle.service.enabled\")\nin a Python or Scala notebook cell to return the current value of the shuffle service. If it returns\ntrue\nthe service is enabled.\nspark.conf.get(\"spark.shuffle.service.enabled\")\nSolution\nDisable the default Spark Shuffle service.\nDisabling the shuffle service does not prevent the shuffle, it just changes the way it is performed. When the service is disabled, the shuffle is performed by the executor.\nYou can disable the shuffle service by adding\nspark.shuffle.service.enabled false\nto the cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nspark.shuffle.service.enabled false\nRestart the cluster after updating the\nSpark config\n.\nDelete\nInfo\nThere is a slight performance impact on when the shuffle service is disabled." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-failures-when-running-apache-spark-jobs-processing-mongodb-data.json b/scraped_kb_articles/job-failures-when-running-apache-spark-jobs-processing-mongodb-data.json new file mode 100644 index 0000000000000000000000000000000000000000..dfacfbfd4d05030345204eac197ad79c28da64f9 --- /dev/null +++ b/scraped_kb_articles/job-failures-when-running-apache-spark-jobs-processing-mongodb-data.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/job-failures-when-running-apache-spark-jobs-processing-mongodb-data", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running Apache Spark jobs that involve data processing with MongoDB, you receive an error message.\ncom.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a DoubleType (value: BsonString{value='XX,XXX'})\r\nat com.mongodb.spark.sql.MapFunctions$.convertToDataType(MapFunctions.scala:214)\r\nat com.mongodb.spark.sql.MapFunctions$.$anonfun$documentToRow$1(MapFunctions.scala:37)\r\nat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\r\nat scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)\r\nat scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)\r\nat scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)\r\nat scala.collection.TraversableLike.map(TraversableLike.scala:286)\r\nat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\nCause\nThere is a schema mismatch between the data types the Spark job expects and the actual data types in the MongoDB source.\nSpecifically in the previous error message, you are attempting to cast a\nSTRING\nvalue in MongoDB into a DoubleType in Spark. ANSI compliance settings in Spark SQL enforce strict type checking and can lead to job failures when data types do not match exactly.\nSolution\nValidate your source data and update your code to make sure data types match.\nIf validating and updating are not feasible, you can disable ANSI compliance in Spark SQL to allow more lenient type conversions. Add the following configurations within a notebook cell or to your cluster settings.\nspark.conf.set(\"spark.sql.ansi.enabled\",\"false\")\r\nspark.conf.set(\"spark.sql.storeAssignmentPolicy\",\"LEGACY\")\nImportant\nWith these configurations, Spark may return null results instead of throwing errors when invalid inputs are encountered in SQL operators or functions, reducing error visibility. Do not use these configurations if data integrity is critical.\nFor more information, refer to the\nANSI compliance in Databricks Runtime\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-id-column-not-consistently-showing-values-in-the-apache-spark-ui-for-sub-execution-ids.json b/scraped_kb_articles/job-id-column-not-consistently-showing-values-in-the-apache-spark-ui-for-sub-execution-ids.json new file mode 100644 index 0000000000000000000000000000000000000000..fab68472d7ac49bffe9d9d3a1f04d2d774777780 --- /dev/null +++ b/scraped_kb_articles/job-id-column-not-consistently-showing-values-in-the-apache-spark-ui-for-sub-execution-ids.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/job-id-column-not-consistently-showing-values-in-the-apache-spark-ui-for-sub-execution-ids", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you’re reviewing your Apache Spark UI to optimize query performance, you notice a\nSub Execution ID\ncolumn in addition to the query\nID\ncolumn in the\nSQL/DataFrame\ntab. You also notice that the\nJob ID\ncolumn doesn’t consistently show values in the UI, as shown in the following image.\nCause\nA Job ID is generated for a Sub Execution ID only when an action, such as\ncollect()\n,\ncount()\n, or\nsaveAsTextFile()\nis triggered. If no such action is required (for example in certain transformations such as\nmap()\n,\nfilter()\n, or simple metadata fetching), Spark will not create a new Job ID, leaving Sub Execution IDs with no corresponding job.\nContext\nSub Execution IDs in the Spark UI are identifiers for individual sub-parts of the queries executed in Spark. A query is divided into multiple Sub Execution IDs to enhance its execution speed. You can review the relationship between Job ID and Sub Execution Job IDs in the\nSQL/DataFrame\ntab.\nSolution\nTo view all the Sub Execution IDs together in your Spark UI’s SQL/DataFrame tab:\nClick the Sub Execution IDs column, which opens in a new window.\nClick the Sub Execution IDs column again to sort the column and see all the IDs together." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-idle-before-start.json b/scraped_kb_articles/job-idle-before-start.json new file mode 100644 index 0000000000000000000000000000000000000000..888e28769f10128bbdacc48f0f07b00a7d14f980 --- /dev/null +++ b/scraped_kb_articles/job-idle-before-start.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/job-idle-before-start", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an Apache Spark job that is triggered correctly, but remains idle for a long time before starting.\nYou have a Spark job that ran well for awhile, but goes idle for a long time before resuming.\nSymptoms include:\nCluster downscales to the minimum number of worker nodes during idle time.\nDriver logs don’t show any Spark jobs during idle time, but does have repeated information about metadata.\nGanglia shows activity only on the driver node.\nExecutor logs show no activity.\nAfter some time passes, cluster scales up and Spark jobs start or resume.\nCause\nThese symptoms indicate that there are a lot of file scan operations happening during this period of the job. Tables are read and consumed in downstream operations.\nYou see the file scan operation details when you review the SQL tab in the Spark UI. The queries appear to be completed, which makes it appear as though no work is being performed during this idle time.\nThe driver node is busy because it is performing the file listing and processing data (metadata containing schema and other information). This work only happens on the driver node, which is why you only see driver node activity in the Ganglia metrics during this time.\nThis issue becomes more pronounced if you have a large number of small files.\nSolution\nYou should control the file size and number of files ingested at the source location by implementing a preprocessing step. You can also break down the ingestion into a number of smaller steps, so a smaller number of files have to be scanned at once.\nAnother option is to migrate your data store to Delta Lake, which uses transactional logs as an index for all the underlying files." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-rate-limit.json b/scraped_kb_articles/job-rate-limit.json new file mode 100644 index 0000000000000000000000000000000000000000..04c26fbcb17cdf59380a98969f58531fca2b2ffa --- /dev/null +++ b/scraped_kb_articles/job-rate-limit.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-rate-limit", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA Databricks notebook or Jobs API request returns the following error:\nError : {\"error_code\":\"INVALID_STATE\",\"message\":\"There were already 1000 jobs created in past 3600 seconds, exceeding rate limit: 1000 job creations per 3600 seconds.\"}\nYou are not able to run jobs getting a QUOTA_EXCEEDED error message:\n'error_code':'QUOTA_EXCEEDED','message':'The quota for the number of jobs has been reached. The current quota is 1000. This quota is only applied to jobs created through the UI or through the /jobs/create endpoint, which are displayed in the Jobs UI\nCause\nThis error occurs because the number of jobs per hour exceeds the limit of 1000 established by Databricks to prevent API abuses and ensure quality of service.\nSolution\nIf you cannot ensure that the number of jobs created in your workspace is less than 1000 per hour, contact Databricks Support to request a higher limit. A job rate limit increase requires at least 20 minutes of downtime.\nDatabricks can increase the job limit maximumJobCreationRate up to 2000.\nCurrently running jobs will be affected while the limit is being increased." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-run-fails-with-error-message-could-not-reach-driver-of-cluster.json b/scraped_kb_articles/job-run-fails-with-error-message-could-not-reach-driver-of-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..3d535a36c3579249bdbd407fba1c849f50b081c2 --- /dev/null +++ b/scraped_kb_articles/job-run-fails-with-error-message-could-not-reach-driver-of-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/job-run-fails-with-error-message-could-not-reach-driver-of-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run a job, you notice it fails intermittently with the following error message.\nError: Run failed with error message: Could not reach driver of cluster .\nCause\nA high load on the driver node (approaching or at 100%) causes CPU thrashing. This thrashing prevents Python REPL threads (ipykernel) from starting within the expected timeout of 80 seconds, instead becoming sluggish or unresponsive during startup. Failing to start within the expected timeout leads to failures.\nTroubleshoot your case\nThe error\n\"Could not reach driver of cluster \"\ncan occur due to several different reasons. Use the following troubleshooting steps to verify the cause of your error matches the cause in this KB article.\nCheck whether the job runs multiple tasks concurrently, which can increase the load on the driver.\nDuring the time of failure, check if the driver’s CPU and memory utilization are unusually high (approaching or at 100%).\nLook for the following error trace in the driver logs. This error indicates a REPL (Read-Eval-Print Loop) startup failure due to timeout, often caused by too many REPLs being created simultaneously.\nFailed to start repl ReplId-\r\ncom.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: \r\nUnable to start python kernel for ReplId-, kernel did not start within 80 seconds.\nSolution\nIncrease the REPL launch timeout by setting the following Apache Spark configuration. This gives the REPL process more time to initialize, which helps prevent failures under high load.\nspark.databricks.driver.ipykernel.launchTimeoutSeconds 300\nFor details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nPreventative measures\nA persistently overloaded driver is not a sustainable setup. For long-term stability and performance:\nLower the number of simultaneous job submissions to prevent overwhelming the driver.\nConsider switching to a driver with more CPU cores to better handle parallel REPL creation and workload." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-timeout-when-connecting-to-a-sql-endpoint-over-jdbc.json b/scraped_kb_articles/job-timeout-when-connecting-to-a-sql-endpoint-over-jdbc.json new file mode 100644 index 0000000000000000000000000000000000000000..785f608a472c7d42b9886d8fbecfebd7144829c4 --- /dev/null +++ b/scraped_kb_articles/job-timeout-when-connecting-to-a-sql-endpoint-over-jdbc.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/job-timeout-when-connecting-to-a-sql-endpoint-over-jdbc", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a job that is reading and writing to an SQL endpoint over a JDBC connection.\nThe SQL warehouse fails to execute the job and you get a\njava.net.SocketTimeoutException: Read timed out\nerror message.\n2022/02/04 17:36:15 - TI_stg_trade.0 - Caused by: com.simba.spark.jdbc42.internal.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out\r\n2022/02/04 17:36:15 - TI_stg_trade.0 - at com.simba.spark.hivecommon.api.TETHttpClient.flushUsingHttpClient(Unknown Source)\r\n2022/02/04 17:36:15 - TI_stg_trade.0 - at com.simba.spark.hivecommon.api.TETHttpClient.flush(Unknown Source)\r\n2022/02/04 17:36:15 - TI_stg_trade.0 - at com.simba.spark.jdbc42.internal.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)\r\n2022/02/04 17:36:15 - TI_stg_trade.0 - at com.simba.spark.jdbc42.internal.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)\nCause\nEach incoming request requires a thread for the duration of the request. When the number of simultaneous requests is greater than the number of available threads, a timeout can occur. This can occur during long running queries.\nSolution\nIncrease the\nSocketTimeout\nvalue in the JDBC connection URL.\nIn this example, the\nSocketTimeout\nis set to 300 seconds:\njdbc:spark://:443;HttpPath=;TransportMode=http;SSL=1[;property=value[;property=value]];SocketTimeout=300\nFor more information, review the\nBuilding the connection URL for the legacy Spark driver\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-to-insert-parquet-file-to-table-fails-with-error-failed_read_fileparquet_column_data_type_mismatch.json b/scraped_kb_articles/job-to-insert-parquet-file-to-table-fails-with-error-failed_read_fileparquet_column_data_type_mismatch.json new file mode 100644 index 0000000000000000000000000000000000000000..0af7af4ff4d9d8e3b71787818410a95a757e14cd --- /dev/null +++ b/scraped_kb_articles/job-to-insert-parquet-file-to-table-fails-with-error-failed_read_fileparquet_column_data_type_mismatch.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-to-insert-parquet-file-to-table-fails-with-error-failed_read_fileparquet_column_data_type_mismatch", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a table with a given number of columns of a given data type, to which you are writing a Parquet file. When you run a job to insert the additional Parquet file, the job fails with an error.\n[FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH] Error while reading file dbfs:/.snappy.parquet. Data type mismatches when reading Parquet column . Expected Spark type , actual Parquet type . SQLSTATE: KD001\nCause\nThe Parquet file's schema contains data types (for example, string) that are different from the expected Apache Spark data types (for example, integer) in the original table.\nThe table schema is validated during file read (not write), and if there is a data type mismatch at that point the error occurs.\nSolution\n1. Examine the error message and identify the Parquet file that is causing the issue.\n2. Examine the schema of the Parquet file using Spark's built-in Parquet reader. Compare the Parquet file’s schema with the original table’s schema.\n3. Fix the Parquet file’s schema by re-writing the data to a separate DataFrame with the correct schema. Use Spark's built-in functions such as\n`withColumn`\nor\n`cast`\nto convert the data types to the expected Spark data types.\n4. Insert the records from the fixed DataFrame back into the original table. Use the following code.\n%python\r\n\r\nfrom pyspark.sql.functions import col\r\nchange_schema = spark.read.parquet()\r\nchange_schema = change_schema.withColumn(\"id\", col(\"id\").cast(\"bigint\"))\r\nchange_schema.write.mode(\"append\").format(\"parquet\").save()\nPreventative measures\n1. Validate the schema of the data before ingesting it into Databricks.\n2. Use data type conversions to ensure that the data types in the Parquet files match the expected Spark data types." +} \ No newline at end of file diff --git a/scraped_kb_articles/job-with-string-data-failing-with-cast_overflow_in_table_insert-overflow-error.json b/scraped_kb_articles/job-with-string-data-failing-with-cast_overflow_in_table_insert-overflow-error.json new file mode 100644 index 0000000000000000000000000000000000000000..b6e5ac2d22b691b28c918d1e1f3dbf20875b020a --- /dev/null +++ b/scraped_kb_articles/job-with-string-data-failing-with-cast_overflow_in_table_insert-overflow-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/job-with-string-data-failing-with-cast_overflow_in_table_insert-overflow-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run a job with string data and the target schema is decimal data type, it fails with the following error message.\n[CAST_OVERFLOW_IN_TABLE_INSERT] Fail to assign a value of \"STRING\" type to the \"DECIMAL(p,s)\" type column or variable `dataColumn` due to an overflow.\nCause\nThe string data is longer than the target column’s size.\nApache Spark performs implicit crosscasting (because ANSI mode is disabled by default for non-serverless computes) based on the target data type. When the string value exceeds the limit of the decimal data type, an overflow exception is thrown.\nSolution\nImportant\nYou may want to explicitly cast\nstring\nto\ndecimal(p,s)\nbefore inserting, but this action will insert null values and still result in overflow in the defined\ndecimal(p,s)\ndata type.\nInstead, there are two options available.\nThe first option is to increase the limit of\ndecimal(p,s)\n. The maximum for the decimal data type is\n38\n. The following image shows an example notebook creating a table with\ndecimal(19,9)\n. Then a decimal formatted as a string is inserted. It is successfully inserted and the output is modified to fit the target schema.\nThe second option is to change the source data type to an equivalent type to the target, instead of string. This will ensure that the data types match and prevent any overflow exceptions.\nFor more information on the decimal data type, refer to the\nDECIMAL TYPE\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-deployed-using-databricks-asset-bundles-show-running-in-the-terminal-but-then-seem-to-hang-indefinitely.json b/scraped_kb_articles/jobs-deployed-using-databricks-asset-bundles-show-running-in-the-terminal-but-then-seem-to-hang-indefinitely.json new file mode 100644 index 0000000000000000000000000000000000000000..f13d667f8474c14ef70a3cffb1f09fa1e119cfd7 --- /dev/null +++ b/scraped_kb_articles/jobs-deployed-using-databricks-asset-bundles-show-running-in-the-terminal-but-then-seem-to-hang-indefinitely.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/jobs-deployed-using-databricks-asset-bundles-show-running-in-the-terminal-but-then-seem-to-hang-indefinitely", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re deploying multiple jobs using Databricks Asset Bundles (DABs), particularly notebooks that utilize Auto Loader. During the deployment process, you executing the following command.\ndatabricks bundle run -t  \nThe terminal returns the following status message.\n2025-01-13 15:58:10 \"\" RUNNING\nFollowing this message, the terminal appears to hang indefinitely.\nThe terminal seeming to hang indefinitely leads to uncertainty about whether the job has successfully started or the command is waiting for the job to complete.\nCause\nBy default, the\ndatabricks bundle run\ncommand performs two primary actions sequentially:\nInitiates the deployment of the job within the specified environment.\nWaits for the job to finish executing before returning control to the terminal.\nFor standard jobs that have a defined completion point, this behavior is typically acceptable. However, in scenarios with continuous jobs, which are designed to run perpetually and may have unlimited retries, there is no natural conclusion. The lack of natural conclusion leads the\ndatabricks bundle run\ncommand to wait indefinitely, causing the terminal to appear unresponsive.\nSolution\nUse the\n--no-wait\nflag with the\ndatabricks bundle run\ncommand. This flag alters the command's behavior to start the job and immediately return control to the terminal without waiting for the job to finish.\ndatabricks bundle run -t  my_job --no-wait\nBenefits of using --no-wait\nThe command initiates the job and returns control, providing immediate confirmation that the job has started.\nThe waiting period associated with job completion is eliminated, streamlining the deployment process.\nThe deployment of multiple continuous jobs is facilitated without the risk of terminal hang-ups or prolonged wait times.\nLess ambiguity about the job's status post-deployment since the command does not linger waiting for a process that is intended to run indefinitely." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-fail-with-error-there-are-already-1000-active-runs-limit-1000.json b/scraped_kb_articles/jobs-fail-with-error-there-are-already-1000-active-runs-limit-1000.json new file mode 100644 index 0000000000000000000000000000000000000000..d48bdb4a56bdd0e0a6f71e6acea6b52649862540 --- /dev/null +++ b/scraped_kb_articles/jobs-fail-with-error-there-are-already-1000-active-runs-limit-1000.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/jobs-fail-with-error-there-are-already-1000-active-runs-limit-1000", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nJobs may fail with an\nerror indicating that the workspace has reached the maximum number of active runs, which is set to 1000.\nRun failed with error message\r\nThere are already 1000 active runs (limit: 1000).\nCause\nJobs are either creating new runs in a loop, causing the workspace to reach the 1000 limit of active runs, or multiple job runs are scheduled at the same time.\nSolution\nFirst, identify the jobs that are causing the issue.\nNavigate to\nJob Runs\nin the UI.\nCheck how many jobs are running.\nVerify if any jobs have a high number of runs corresponding to the same job ID (evidence of creating new runs in a loop).\nNext, cancel the runs for the identified jobs using the UI or the API.\nThrough the UI:\nNavigate to\nWorkflows\n.\nSelect the job that has been identified as running in a loop.\nFrom the\nRuns\ntab, select the option\nCancel Runs\n.\nTo use the API, please refer to the\nCancel all runs of a job\ndocumentation.\nThen, check the job configuration and scheduling settings. Be sure you define schedules based on your expected timeframe, avoiding running jobs within seconds or minutes of each other. Avoid steps that call the job multiple times.\nLast, monitor the workspace to ensure that the issue does not recur." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-failing-at-data-shuffle-stage-with-error-orgapachesparkshufflefetchfailedexception.json b/scraped_kb_articles/jobs-failing-at-data-shuffle-stage-with-error-orgapachesparkshufflefetchfailedexception.json new file mode 100644 index 0000000000000000000000000000000000000000..6c721255f9b7a7f89e4ef66cdab90254a600d578 --- /dev/null +++ b/scraped_kb_articles/jobs-failing-at-data-shuffle-stage-with-error-orgapachesparkshufflefetchfailedexception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/jobs-failing-at-data-shuffle-stage-with-error-orgapachesparkshufflefetchfailedexception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an application with joins or aggregations, so need your data to reorganize according to the key. This requires a data shuffle stage. Your job fails at this data shuffle stage with the following exception.\norg.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 149 (localCheckpoint at MergeIntoMaterializeSource.scala:364) has failed the maximum allowable number of times: 4. Most recent failure reason:\r\norg.apache.spark.shuffle.FetchFailedException\r\n\r\nat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1168)\r\nat org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$next$1(ShuffleBlockFetcherIterator.scala:904)\r\nat com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)\r\nat \r\nCaused by: java.nio.channels.ClosedChannelException\nCause\nThis issue occurs when a task run is not able to fetch shuffle data that is supposed to serve as input to the stage. The inability to fetch shuffle data can occur when:\nA cluster downsizes during an auto-scaling event before the shuffle data is read.\nA spot instance is terminated.\nAn executor becomes unresponsive or runs out of memory.\nCluster downsizing and terminated spot instances cause ongoing shuffle operations in an Apache Spark job to fail, due to missing data from the terminated executor.\nWhen an executor hosting the shuffle data or shuffle server gets overwhelmed and stops responding to shuffle data requests, the nonresponsiveness causes the failure.\nSolution\nFirst, diagnose which cause is behind your job’s failure. The process is detailed in the\nFailing jobs or executors removed\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFrom there, long-term solutions can focus on optimizing code logic, data layout, and minimizing data explosions depending on your context. The following two options specifically consider large joins or tasks handling excessive data.\nIf large joins cause fetch failures, broadcast smaller tables to avoid shuffles.\nspark.sql.autoBroadcastJoinThreshold   \r\nspark.databricks.adaptive.autoBroadcastJoinThreshold \nIf tasks handle excessive data, leading to memory issues or disk spills, increase shuffle partitions or use auto. Note that changing\nspark.sql.shuffle.partitions\nonly applies when a new query is started. Resuming will always pick up the previous value.\nspark.sql.shuffle.partitions appropriate_value or auto\nFor temporary, short term solutions to help your job succeed, consider the following configurations.\nUse the Spark UI to check if a failed shuffle stage made significant progress after X number of attempts. If so, increasing retry attempts may help job progress past the stage.\nspark.shuffle.io.maxRetries (default: 3)\nConfigure task-level parallelism carefully to avoid underutilized CPU resources. Note that\nspark.task.cpus\nincreases the number of cores per task and reduces the overall load on the worker. This reduces parallelism, so increase the number of workers accordingly.\nspark.task.cpus where value  > 1 but < total_cores\nImportant\nIncreasing the\nspark.task.cpus\nvalue without increasing worker count increases job duration.\nPrevent worker side network related timeouts due during large garbage collection pauses by increasing timeout values.\nspark.network.timeout = 800s  \r\nspark.rpc.timeout = 600s\nAs a final temporary measure, you can increase cluster capacity. Note that this may incur extra costs." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-failing-with-bindexception-error-after-upgrading-to-databricks-runtime-11-3-lts-or-above.json b/scraped_kb_articles/jobs-failing-with-bindexception-error-after-upgrading-to-databricks-runtime-11-3-lts-or-above.json new file mode 100644 index 0000000000000000000000000000000000000000..749a1d81ab502dd212350cdde4019d5e1b46b604 --- /dev/null +++ b/scraped_kb_articles/jobs-failing-with-bindexception-error-after-upgrading-to-databricks-runtime-11-3-lts-or-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/jobs-failing-with-bindexception-error-after-upgrading-to-databricks-runtime-11-3-lts-or-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice your jobs fail after upgrading Databricks Runtime to 11.3 LTS or above with a\nBindException\nerror message.\nCaused by: java.net.BindException: Address already in use.\nCause\nIpywidgets, introduced as of Databricks Runtime 11.2, by default occupies port 6062. If you have a third-party integration tool installed on your cluster, such as Datadog, when ipywidgets attempts to use port 6062, it will find the port already occupied by another service and throw the error.\nSolution\nChange the default port used by ipywidgets to another available port.\nVerify that other services in your environment are not already using your chosen port. Databricks also recommends avoiding the ports its services use: 443, 3306, 6666, 2443, and 8443.\nSet the Apache Spark configuration\nspark.databricks.driver.ipykernel.commChannelPort\nto a different port number when you create your cluster. Use the following code snippet.\nspark.databricks.driver.ipykernel.commChannelPort \nFor more information on configuring ipywidgets, refer to the\nipywidgets\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-failing-with-schema-conversion-error-cannot-convert-parquet-type-int32-to-photon-type-long.json b/scraped_kb_articles/jobs-failing-with-schema-conversion-error-cannot-convert-parquet-type-int32-to-photon-type-long.json new file mode 100644 index 0000000000000000000000000000000000000000..69cf3405356daa4147e4643d9811bce903095361 --- /dev/null +++ b/scraped_kb_articles/jobs-failing-with-schema-conversion-error-cannot-convert-parquet-type-int32-to-photon-type-long.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/jobs-failing-with-schema-conversion-error-cannot-convert-parquet-type-int32-to-photon-type-long", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Parquet files in Delta Lake on a Photon-enabled cluster, you notice your jobs fail with the following error.\nSchema conversion error: cannot convert Parquet type INT32 to Photon type long\nCause\nThe Photon engine uses a different set of data types than the traditional Apache Spark execution engine. The INT32 type used in the Parquet files is not directly convertible to Photon's native long data type. This mismatch causes a schema conversion error.\nSolution\nDisable the Photon's reader by setting the configuration property\nspark.databricks.photon.scan.enabled\nto\nfalse\n.\nThis configuration change bypasses the Photon engine's reader, which is the component responsible for the schema conversion error. The Spark engine reverts to its traditional reader, which is able to handle the INT32 data type without issues.\nAdditionally, ensure you’re using a Databricks Runtime that supports mixed types in Parquet files, such as Databricks Runtime 14.3 LTS or above.\nImportant\nDisabling Photon on a cluster may decrease the efficiency of some queries, resulting in slower executions. However, disabling allows the job to run successfully in this case." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-fails-with-a-timeoutexception-error.json b/scraped_kb_articles/jobs-fails-with-a-timeoutexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..3e26bf250861dc199985fdf5978e003afe901756 --- /dev/null +++ b/scraped_kb_articles/jobs-fails-with-a-timeoutexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/jobs-fails-with-a-timeoutexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running Apache Spark SQL queries that perform join operations DataFrames, but the queries keep failing with a\nTimeoutException\nerror message.\nExample stack trace\nCaused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]\nat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)\nat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\nat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\nat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\nCause\nThis problem usually stems from Spark trying to perform Broadcast join and when the fetching of blocks from different executors consumes an excessive amount of time.\nSpark performs Broadcast join using the BitTorrent protocol. The driver splits the data to be broadcasted into small chunks and stores the chunks in the block manager of the driver. The driver also sends the chunks of data to the executors. Each executors keeps copies of the chunks of data in its own block manager.\nWhen a specific executor is not able to fetch the chunks of data from its local block manager (say the executor died and re-launched) that executor tries to fetch the broadcast data from the driver as well as other executors. This avoids driver being the bottleneck in serving the remote requests.\nEven with this distributed approach, there are some scenarios where the broadcast can take an excessive amount of time, resulting in a\nTimeoutException\nerror.\nBusy driver or busy executor\n: If the Spark driver and executors are extremely busy, it can introduce delay in the broadcast process. If the broadcast process exceeds the threshold limits, it can result in a broadcast timeout.\nLarge broadcast data size\n: Trying to broadcast a large amount of data can also result in a broadcast timeout. Spark has a default limit of 8GB for broadcast data.\nSolution\nYou need to identify the query that is causing resource bottleneck on the cluster. Open the\nSpark UI\n(\nAWS\n|\nAzure\n|\nGCP\n) and review any failed stages to locate the SQL query causing the failure. Review the Spark SQL plan to see if it uses\nBroadcastNestedLoopJoin\n.\nIf the Spark SQL plan uses\nBroadcastNestedLoopJoin\n, you need to follow the instructions in the\nDisable broadcast when query plan has BroadcastNestedLoopJoin\narticle.\nIf the Spark SQL plan does not use\nBroadcastNestedLoopJoin\n, you can disable the Broadcast join by setting Spark config values right before the problematic query. You can then revert these changes after the problematic query. Making the change query specific allows other queries, which can benefit from the Broadcast join, to still leverage the benefits.\nSET spark.sql.autoBroadcastJoinThreshold=-1\nThis disables Broadcast join.\nSET spark.databricks.adaptive.autoBroadcastJoinThreshold=-1\nThis particular configuration disables adaptive Broadcast join.\nAnother option is to increase\nspark.sql.broadcastTimeout\nto a value above 300 seconds, which is the default value. Increasing\nspark.sql.broadcastTimeout\nallows more time for the broadcasting process to finish before it generates a failure. The downside to this approach, is that it may result in longer query times.\nFor example, setting the value to 600 doubles the amount of time for the Broadcast join to complete.\nSET spark.sql.broadcastTimeout=600\nThis value can be set at the cluster level or the notebook level." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-idempotency.json b/scraped_kb_articles/jobs-idempotency.json new file mode 100644 index 0000000000000000000000000000000000000000..f7b45350812c52ee5d502ac3d7c780f6a1a728c2 --- /dev/null +++ b/scraped_kb_articles/jobs-idempotency.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/jobs-idempotency", + "title": "Título do Artigo Desconhecido", + "content": "When you submit jobs through the Databricks Jobs REST API, idempotency is not guaranteed. If the client request is timed out and the client resubmits the same request, you may end up with duplicate jobs running.\nTo ensure job idempotency when you submit jobs through the Jobs API, you can use an idempotency token to define a unique value for a specific job run. If the same job has to be retried because the client did not receive a response due to a network error, the client can retry the job using the same idempotency token, ensuring that a duplicate job run is not triggered.\nHere is an example of a REST API JSON payload for the Runs Submit API using an\nidempotency_token\nwith a value of 123:\n{\r\n  \"run_name\":\"my spark task\",\r\n  \"new_cluster\": {\r\n    \"spark_version\":\"5.5.x-scala2.11\",\r\n    \"node_type_id\":\"r5.xlarge\",\r\n    \"aws_attributes\": {\r\n      \"availability\":\"ON_DEMAND\"\r\n    },\r\n    \"num_workers\":10\r\n  },\r\n  \"libraries\": [\r\n    {\r\n      \"jar\":\"dbfs:/my-jar.jar\"\r\n    },\r\n    {\r\n      \"maven\": {\r\n        \"coordinates\":\"org.jsoup:jsoup:1.7.2\"\r\n      }\r\n    }\r\n  ],\r\n  \"spark_jar_task\": {\r\n    \"main_class_name\":\"com.databricks.ComputeModels\"\r\n  },\r\n  \"idempotency_token\":\"123\"\r\n}\nAll requests with the same idempotency token should return 200 with the same run ID." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-in-sql-warehouse-returning-403-error.json b/scraped_kb_articles/jobs-in-sql-warehouse-returning-403-error.json new file mode 100644 index 0000000000000000000000000000000000000000..aa55547692e634cb3646292775645311e1927da7 --- /dev/null +++ b/scraped_kb_articles/jobs-in-sql-warehouse-returning-403-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/jobs-in-sql-warehouse-returning-403-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running jobs in SQL Warehouse, you receive 403 errors.\nCause\nThe service account used to execute the jobs does not have sufficient permissions. The service account should have at least\nCAN USE\nor\nCAN MANAGE\n.\nSolution\nFirst, modify the service account permissions.\nLog in to your Databricks workspace as an administrator.\nNavigate to\nSQL Warehouse\nsettings and click\nPermissions\n.\nLocate the service account used to execute the jobs.\nClick the three dots next to the service account and select\nEdit Permissions\n.\nFrom the dropdown menu, select either\n\"Can Use\"\nor\n\"Can Manage\"\n.\nClick\nSave\nto apply the changes.\nNext, regenerate the token for the service account.\nLog in to the Databricks workspace using the service account.\nGo to\nUser Settings > Access Tokens\n.\nClick\nGenerate New Token\nand follow the prompts to create a new token.\nUpdate the job configuration with the new token.\nThen, verify permissions are set for the service account.\nLog in to the Databricks workspace as an administrator.\nNavigate to\nSQL Warehouse\nsettings and click\nPermissions\n.\nConfirm the service account now has\n\"Can Use\"\nor\n\"Can Manage\"\npermissions.\nLast, re-run the job with the updated token and ensure the 403 error doesn’t occur again." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-running-longer-than-expected-with-metastore_down-events-in-event-log.json b/scraped_kb_articles/jobs-running-longer-than-expected-with-metastore_down-events-in-event-log.json new file mode 100644 index 0000000000000000000000000000000000000000..b1bfdd82b44fb5caffc48e6e2ea1e7dc5331f620 --- /dev/null +++ b/scraped_kb_articles/jobs-running-longer-than-expected-with-metastore_down-events-in-event-log.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/jobs-running-longer-than-expected-with-metastore_down-events-in-event-log", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have jobs in Databricks that run for longer than expected. When you check the event log, you see a\nMetastore_Down\nevent. This happens when you use Hive or an external metastore like AWS Glue.\nWhen you analyze the thread dump, you find threads stuck at\ndelta-catalog-update\n.\nSample thread\ndelta-catalog-update-8\" #518 daemon prio=5 os_prio=0 tid=xxx nid=xxx waiting on condition [xxx]\r\n  java.lang.Thread.State: WAITING (parking)\r\nat sun.misc.Unsafe.park(Native Method)\r\n- parking to wait for (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)\r\nat java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)\r\nat java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)\r\nat org.spark_project.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:590)\r\nat org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:432)\r\nat org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:349)\r\nat org.apache.spark.sql.hive.client.LocalHiveClientsPool.super$borrowObject(LocalHiveClientImpl.scala:124)\r\nat org.apache.spark.sql.hive.client.LocalHiveClientsPool.$anonfun$borrowObject$1(LocalHiveClientImpl.scala:124)\r\nat org.apache.spark.sql.hive.client.LocalHiveClientsPool$$Lambda$5460/xxx.apply(Unknown Source)\r\nat com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:394)\r\nat com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)\r\nat org.apache.spark.sql.hive.client.LocalHiveClientsPool.borrowObject(LocalHiveClientImpl.scala:122)\r\nat org.apache.spark.sql.hive.client.PoolingHiveClient.retain(PoolingHiveClient.scala:181)\r\nat org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:110)\r\nat org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$1(HiveExternalCatalog.scala:150)\r\nat org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$5186/xxx.apply(Unknown Source)\r\nat com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:394)\r\nat com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)\r\nat org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:149)\r\nat org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:1027)\r\nat org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:154)\r\nat org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.tableExists(SessionCatalog.scala:936)\r\nat com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.tableExists(ManagedCatalogSessionCatalog.scala:763)\r\nat com.databricks.sql.transaction.tahoe.hooks.UpdateCatalog.tableStillExists$1(UpdateCatalog.scala:112)\nCause\nThis happens when catalog update operations saturate the Hive client thread pool. The delta update threads can exhaust all Hive client connections, which prevents other query operations, and results in hanging jobs. This usually occurs if there is an update to the table metadata in the catalog through the\nALTER TABLE\ncommand.\nSolution\nThere are three options to try depending on your case.\nRun the\nVACUUM\ncommand\nCheck if there are a large number of files for the table.\nPeriodically run a vacuum on Delta tables to remove stale and unreferenced files, which can help in reducing the load on the metastore.\nAdjust catalog update thread pool size\nIn Databricks Runtime 14.3 LTS and above, you can control the size of the thread pool used to update the catalog. To set this configuration, adjust\nspark.databricks.delta.catalog.update.threadPoolSize\nto a value less than the default of\n20\n.\nspark.databricks.delta.catalog.update.threadPoolSize \nDisable Delta catalog update\nIf you’re using a read-only metastore database, Databricks recommends setting the following configuration on your clusters. This configuration controls the syncing of the most recent schema and table properties of a Delta table with the Hive metastore (or any external catalog) to ensure both of them stay the same.\nspark.databricks.delta.catalog.update.enabled false\nImportant\nIf other systems access your external metastore for this table schema or table properties, do not use this option. Keep\nenabled\nset to\ntrue\nto ensure they sync." +} \ No newline at end of file diff --git a/scraped_kb_articles/jobs-using-apache-spark-351-and-the-elasticsearch-hadoop-connector-failing-with-microbatchexecution-error.json b/scraped_kb_articles/jobs-using-apache-spark-351-and-the-elasticsearch-hadoop-connector-failing-with-microbatchexecution-error.json new file mode 100644 index 0000000000000000000000000000000000000000..1d75e011cf8326eead8623ebbb7e56d2468c47ae --- /dev/null +++ b/scraped_kb_articles/jobs-using-apache-spark-351-and-the-elasticsearch-hadoop-connector-failing-with-microbatchexecution-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/jobs-using-apache-spark-351-and-the-elasticsearch-hadoop-connector-failing-with-microbatchexecution-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use Apache Spark 3.5.1 and the Elasticsearch Hadoop connector in your Databricks migration efforts, you notice a compatibility issue on your Elasticsearch instances and your job fails with the following error.\nERROR streaming.MicroBatchExecution - Query reconquery [id = , runId = ] terminated with error\r\norg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0 in stage 0.0 (TID 0) (executor 0): java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/catalyst/encoders/ExpressionEncoder;\r\n    at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.(EsStreamQueryWriter.scala:50)\r\n    at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink.$anonfun$addBatch$5(EsSparkSqlStreamingSink.scala:72)\r\n    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\r\n    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)\r\n    at org.apache.spark.scheduler.Task.run(Task.scala:141)\r\n    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)\r\n    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\r\n    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\r\n    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)\r\n    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)\r\n    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\r\n    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\r\n    at java.lang.Thread.run(Thread.java:750)\nCause\nThe Elasticsearch Hadoop connector is not compatible with Spark version 3.5.1. The issue is on the Elasticsearch side. For more information, refer to Elastic Project’s Github issue,\nES-hadoop is not compatible with spark 3.5.1 #2210\n.\nSolution\nDatabricks does not own or maintain the Elasticsearch Hadoop connector. To solve this issue:\nReach out to the Elastic team for assistance.\nUse Databricks Runtime 13.3 LTS instead, which has compatibility on Spark 3.4.1." +} \ No newline at end of file diff --git a/scraped_kb_articles/join-operation-on-masked-column-using-masked-values-instead-of-unmasked-ones.json b/scraped_kb_articles/join-operation-on-masked-column-using-masked-values-instead-of-unmasked-ones.json new file mode 100644 index 0000000000000000000000000000000000000000..8af473b46aa3109b9fc6c43e7e583c2e7b14769b --- /dev/null +++ b/scraped_kb_articles/join-operation-on-masked-column-using-masked-values-instead-of-unmasked-ones.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/join-operation-on-masked-column-using-masked-values-instead-of-unmasked-ones", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you perform a JOIN operation on a masked column, you notice the system uses the masked values for the JOIN instead of the real (unmasked) values. You receive different results than what you expect.\nCause\nRow filters and column masks are applied immediately after the table rows are scanned in the query plan, before any further operations such as joins or aggregations. Consequently, joins on masked columns use the masked values.\nThis is expected behavior, designed to ensure sensitive data from masked columns cannot be inadvertently exposed during query operations.\nFor more information, review the\nColumn mask clause\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nIf you need to join on sensitive columns without exposing the real values, create a materialized view followed by dynamic view with column-level security or hash sensitive columns.\nCreate a materialized view\nFirst, create a\nmaterialized view\n(\nAWS\n|\nAzure\n|\nGCP\n) that joins tables using unmasked sensitive columns. This allows the join operation to occur without masking interfering with the join logic.\nThe following example code creates a materialized view that joins\n\nand\n\non the unmasked\nsensitive_id\ncolumn.\n%sql\r\nCREATE OR REPLACE MATERIALIZED VIEW \r\nAS SELECT\r\n a.unmasked_id,\r\n a.name,\r\n b.address\r\nFROM a\r\nINNER JOIN b\r\n ON a.unmasked_id = b.unmasked_id\nThen create a\ndynamic view\n(\nAWS\n|\nAzure\n|\nGCP\n) on top of the materialized view to enforce column-level security. This dynamic view should implement masking or redaction policies to ensure sensitive data is only visible to authorized users.\nThe following example code creates a dynamic view on top of the previously-created materialized view to enforce column-level security using Unity Catalog functions.\n%sql\r\nCREATE OR REPLACE VIEW  AS\r\nSELECT\r\n CASE\r\n   WHEN is_account_group_member('admin') THEN unmasked_id\r\n   ELSE '****'\r\n END AS unmasked_id,\r\n  name,\r\n  CASE\r\n   WHEN is_account_group_member('admin') THEN address\r\n   ELSE CONCAT('**** ', SUBSTRING(address, -6))\r\n END AS address\r\nFROM ;\nHash sensitive columns\nHash sensitive columns and perform joins using these hashed representations.\nFirst, hash sensitive columns during table creation. The following example code creates tables\n\nand\n\nwith hashed versions of the sensitive columns to protect the original data. For more information, review the\nhash\nfunction\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n%sql\r\n\r\n-- Table A with hashed IDs\r\nCREATE OR REPLACE TABLE  AS\r\nSELECT\r\n hash(unmasked_id) AS hashed_id,\r\n name\r\nFROM ;\r\n\r\n-- Table B with hashed IDs and addresses\r\nCREATE OR REPLACE TABLE  AS\r\nSELECT\r\n hash(unmasked_id) AS hashed_id,\r\n xxhash64(address) AS hashed_address\r\nFROM ;\nThen perform a\nJOIN\non the hashed columns from the previously-created tables, ensuring sensitive data remains protected throughout the process. The following code provides an example.\nSELECT\r\n a.hashed_id,\r\n a.name,\r\n b.hashed_address\r\nFROM a\r\nINNER JOIN b\r\n ON a.hashed_id = b.hashed_id;" +} \ No newline at end of file diff --git a/scraped_kb_articles/join-two-dataframes-duplicated-columns.json b/scraped_kb_articles/join-two-dataframes-duplicated-columns.json new file mode 100644 index 0000000000000000000000000000000000000000..d3b87b7fb54cd32cef6eda7b8197f69fdd0e493d --- /dev/null +++ b/scraped_kb_articles/join-two-dataframes-duplicated-columns.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/join-two-dataframes-duplicated-columns", + "title": "Título do Artigo Desconhecido", + "content": "If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns.\nJoin on columns\nIf you join on columns, you get duplicated columns.\nScala\n%scala\r\n\r\nval llist = Seq((\"bob\", \"2015-01-13\", 4), (\"alice\", \"2015-04-23\",10))\r\nval left = llist.toDF(\"name\",\"date\",\"duration\")\r\nval right = Seq((\"alice\", 100),(\"bob\", 23)).toDF(\"name\",\"upload\")\r\n\r\nval df = left.join(right, left.col(\"name\") === right.col(\"name\"))\nPython\n%python\r\n\r\nllist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]\r\nleft = spark.createDataFrame(llist, ['name','date','duration'])\r\nright = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])\r\n\r\ndf = left.join(right, left.name == right.name)\nSolution\nSpecify the join column as an array type or string.\nScala\n%scala\r\n\r\nval df = left.join(right, Seq(\"name\"))\n%scala\r\n\r\nval df = left.join(right, \"name\")\nPython\n%python\r\n\r\ndf = left.join(right, [\"name\"])\n%python\r\n\r\ndf = left.join(right, \"name\")\nR\nFirst register the DataFrames as tables.\n%python\r\n\r\nleft.createOrReplaceTempView(\"left_test_table\")\r\nright.createOrReplaceTempView(\"right_test_table\")\n%r\r\n\r\nlibrary(SparkR)\r\nsparkR.session()\r\nleft <- sql(\"SELECT * FROM left_test_table\")\r\nright <- sql(\"SELECT * FROM right_test_table\")\nThe above code results in duplicate columns. The following code does not.\n%r\r\n\r\nhead(drop(join(left, right, left$name == right$name), left$name))\nJoin DataFrames with duplicated columns notebook\nReview the\nJoin DataFrames with duplicated columns example notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/jpn-char-external-metastore.json b/scraped_kb_articles/jpn-char-external-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..ada5302bed681319c9c742c254f592900643c636 --- /dev/null +++ b/scraped_kb_articles/jpn-char-external-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/jpn-char-external-metastore", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use Japanese characters in your tables, but keep getting errors.\nCreate a table with the OPTIONS keyword\nOPTIONS\nprovides extra metadata to the table. You try creating a table with\nOPTIONS\nand specify the\ncharset\nas\nutf8mb4\n.\n%sql\r\n\r\nCREATE TABLE default.JPN_COLUMN_NAMES('作成年月' string\r\n,'計上年月' string\r\n,'所属コード' string\r\n,'生保代理店コード_8桁' string\r\n,'所属名' string\r\n)\r\nusing csv  OPTIONS (path \"/mnt/tabledata/testdata/\", header \"true\", delimiter \",\", inferSchema \"false\", ignoreLeadingWhiteSpace \"false\", ignoreTrailingWhiteSpace \"false\", multiLine \"true\", escape \"\\\"\" , charset \"utf8mb4\");\nThe result is an error.\nError in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:javax.jdo.JDODataStoreException: Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?)\nCreate a table without the OPTIONS keyword\nYou try to create a table without using\nOPTIONS\n.\n%sql\r\n\r\nCREATE TABLE test.JPN_COLUMN_NAMES (`作成年月` string ,`計上年月` string) USING csv\r\ndescribe extended test.JPN_COLUMN_NAMES;\nThe table appears to be created, but the column names are shown as\n????\ninstead of using the specified Japanese characters.\nCreate a table with Hive table expression\nYou try creating a Hive format table and specify the charset as\nutf8mb4\n.\n%sql\r\n\r\nCREATE TABLE test.JPN_COLUMN_NAMES (`作成年月` string ,`計上年月` string)\r\n   ROW FORMAT SERDE \"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe\"\r\n   WITH SERDEPROPERTIES ( \"separatorChar\" = \",\",\r\n    \"quoteChar\" = \"\\\"\",\r\n    \"escapeChar\" = \"\\\\\",\r\n    \"serialization.encoding\"='utf8mb4')\r\n    TBLPROPERTIES ( 'store.charset'='utf8mb4',\r\n    'retrieve.charset'='utf8mb4');\nThe result is an error.\nCaused by: java.sql.SQLException: Incorrect string value: '\\xE4\\xBD\\x9C\\xE6\\x88\\x90...' for column 'COLUMN_NAME' at row 1\r\nQuery is: INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,`COLUMN_NAME`,TYPE_NAME,INTEGER_IDX) VALUES (6544,,'作成年月','string',0)\nCause\nWhen a table is created, an entry is updated in the Hive metastore. The Hive metastore is typically a MySQL database.\nWhen a new table is created, the names of the columns are inserted into the\nTABLE_PARAMS\nof the metastore.\nThe\ncharset\ncollation of\nPARAM_VALUE\nfrom\nTABLE_PARAMS\nis\nlatin1_bin\nas collation and the\ncharset\nis\nlatin1\n.\n%scala\r\n\r\nexecuteQuery(\"\"\"SELECT TABLE_SCHEMA , TABLE_NAME , COLUMN_NAME , COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TABLE_PARAMS' \"\"\")\nSolution\nlatin1\ndoes not have support for Japanese characters, but\nUTF-8\ndoes.\nYou need to use an external metastore with\nUTF-8_bin\nas collation and the\ncharset\nas\nUTF-8\n.\nAny MySQL database 5.6 or above can be used as a Hive metastore.\nFor this example, we are using MySQL 8.0.13-4.\nCreate an external Apache Hive metastore (\nAWS\n|\nAzure\n|\nGCP\n).\nCreate a database to instantiate the new metastore with default tables.\n%sql\r\n\r\ncreate database \nThe newly created tables can be explored in the external database objects browser or by using the show tables command.\n%sql\r\n\r\n-- Run in the metastore database.\r\nshow tables in \nCheck the collation information in MySQL at the table level.\n%sql\r\n\r\nSELECT TABLE_COLLATION,TABLE_NAME,TABLE_TYPE,TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES where TABLE_TYPE like 'BASE%'\nCheck the collation information in MySQL at the column level.\n%sql\r\n\r\nSELECT TABLE_SCHEMA , TABLE_NAME , COLUMN_NAME , COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS\nChange the\ncharset\nfrom\nlatin1\nto\nUTF-8\n.\n%sql\r\n\r\n-- Run in the metastore database. All queries are compatible with MySQL.\r\n-- Change collation and charset across the database.\r\nALTER DATABASE CHARACTER SET utf8 COLLATE utf8_bin;\r\n-- Change collation and charset per table.\r\nALTER TABLE CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;\r\n-- Change collation and charset at the column level.\r\nALTER TABLE MODIFY CHARACTER SET utf8 COLLATE utf8_bin;\nYou can now correctly view Japanese characters when you display the table." +} \ No newline at end of file diff --git a/scraped_kb_articles/json-reader-parses-value-as-null.json b/scraped_kb_articles/json-reader-parses-value-as-null.json new file mode 100644 index 0000000000000000000000000000000000000000..91a07942a5a7b780fbc98816a66ff2fb01a02750 --- /dev/null +++ b/scraped_kb_articles/json-reader-parses-value-as-null.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/json-reader-parses-value-as-null", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to read a JSON file.\nYou know the file has data in it, but the Apache Spark JSON reader is returning a\nnull\nvalue.\nExample code\nYou can use this example code to reproduce the problem.\nCreate a test JSON file in DBFS.\n%python\r\n\r\ndbutils.fs.rm(\"dbfs:/tmp/json/parse_test.txt\")\r\ndbutils.fs.put(\"dbfs:/tmp/json/parse_test.txt\",\r\n\"\"\"\r\n{\"data_flow\":{\"upstream\":[{\"$\":{\"source\":\"input\"},\"cloud_type\":\"\"},{\"$\":{\"source\":\"File\"},\"cloud_type\":{\"azure\":\"cloud platform\",\"aws\":\"cloud service\"}}]}}\r\n\"\"\")\nRead the JSON file.\n%python\r\n\r\njsontest = spark.read.option(\"inferSchema\",\"true\").json(\"dbfs:/tmp/json/parse_test.txt\")\r\ndisplay(jsontest)\nThe result is a null value.\nCause\nIn Spark 2.4 and below, the JSON parser allows empty strings. Only certain data types, such as\nIntegerType\nare treated as\nnull\nwhen empty.\nIn Spark 3.0 and above, the JSON parser does not allow empty strings. An exception is thrown for all data types, except\nBinaryType\nand\nStringType\n.\nFor more information, review the\nSpark SQL Migration Guide\n.\nExample code\nThe example JSON shows the error because the data has two identical classification fields.\nThe first\ncloud_type\nentry is an empty string. The second\ncloud_type\nentry has data.\n\"cloud_type\":\"\"\r\n\"cloud_type\":{\"azure\":\"cloud platform\",\"aws\":\"cloud service\"}\nBecause the JSON parser does not allow empty strings in Spark 3.0 and above, a\nnull\nvalue is returned as output.\nSolution\nSet the\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n) value\nspark.sql.legacy.json.allowEmptyString.enabled\nto\nTrue\n. This configures the Spark 3.0 JSON parser to allow empty strings.\nYou can set this configuration at the cluster level or the notebook level.\nExample code\n%python\r\n\r\nspark.conf.set(\"spark.sql.legacy.json.allowEmptyString.enabled\", True)\r\njsontest1 = spark.read.option(\"inferSchema\",\"true\").json(\"dbfs:/tmp/json/parse_test.txt\")\r\ndisplay(jsontest1)" +} \ No newline at end of file diff --git a/scraped_kb_articles/json-unicode.json b/scraped_kb_articles/json-unicode.json new file mode 100644 index 0000000000000000000000000000000000000000..d5f4ab1d6bdbf263d74f7a710e2a0b675b6e97f3 --- /dev/null +++ b/scraped_kb_articles/json-unicode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/json-unicode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nSpark job fails with an exception containing the message:\nInvalid UTF-32 character 0x1414141(above 10ffff)  at char #1, byte #7)\r\nAt org.apache.spark.sql.catalyst.json.JacksonParser.parse\nCause\nThe JSON data source reader is able to automatically detect encoding of input JSON files using\nBOM\nat the beginning of the files.\nHowever, BOM is not mandatory by Unicode standard and prohibited by\nRFC 7159\n.\nFor example, section 8.1 says,\n\"\nImplementations MUST NOT add a byte order mark to the beginning of a JSON text.\"\nAs a consequence, Spark is not always able to detect the charset correctly and read the JSON file.\nSolution\nTo solve the issue, disable the charset auto-detection mechanism and explicitly set the charset using the encoding option:\n%scala\r\n\r\n.option(\"encoding\", \"UTF-16LE\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/jvm-based-workloads-unable-to-access-unity-catalog-volumes.json b/scraped_kb_articles/jvm-based-workloads-unable-to-access-unity-catalog-volumes.json new file mode 100644 index 0000000000000000000000000000000000000000..af6e14d46927319020f12c04d02824268dc53ce9 --- /dev/null +++ b/scraped_kb_articles/jvm-based-workloads-unable-to-access-unity-catalog-volumes.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/jvm-based-workloads-unable-to-access-unity-catalog-volumes", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nLow-level Resilient Distributed Dataset (RDD) API calls in your JVM-based workloads cannot access Unity Catalog volumes. You encounter errors in SSL initialization, dependency loading, or schema registry access.\nCause\nUnity Catalog locations, including volumes and external locations, are not compatible with RDDs in Apache Spark. The restrictions are in place to maintain the Unity Catalog-enforced security model and process isolation. Unity Catalog is designed to work primarily with higher-level Spark APIs and its governance model.\nSpark and other JVM processes can only access Unity Catalog volumes or workspace files using the readers and writers that support Unity Catalog. For more information, refer to the\nWork with files on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nThe appropriate solution depends on the workload. Two common approaches to try are:\nUse init scripts to copy necessary files to local storage, by copying the files from UC storage location to DBFS or direct cloud storage path. For more information, refer to the\nWhat are init scripts?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nConfigure libraries to use supported storage mechanisms, such as cloud storage paths. For details, refer to your respective documentation in the following sub-list.\nConnect to Amazon S3\n(\nAWS\n|\nAzure\n|\nGCP\n)\nConnect to Connect to Azure Data Lake Storage and Blob Storage\n(\nAWS\n|\nAzure\n|\nGCP\n)\nConnect to Google Cloud Storage\n(\nAWS\n|\nAzure\n|\nGCP\n)\nIf you find neither of these approaches works for you, contact Databricks Support to identify an alternative way forward.\nPreventive Measures\nTo avoid similar issues in the future, assign checkpoint locations to explicit storage location paths (instead of relying on Unity Catalog volume) such as S3, Azure file paths, or Google Cloud GCS buckets." +} \ No newline at end of file diff --git a/scraped_kb_articles/kafka-client-term-offsetoutofrange.json b/scraped_kb_articles/kafka-client-term-offsetoutofrange.json new file mode 100644 index 0000000000000000000000000000000000000000..673cd71c1ff49a8b810c4f2972a7e35be604ed6e --- /dev/null +++ b/scraped_kb_articles/kafka-client-term-offsetoutofrange.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/kafka-client-term-offsetoutofrange", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an Apache Spark application that is trying to fetch messages from an Apache Kafka source when it is terminated with a\nkafkashaded.org.apache.kafka.clients.consumer.OffsetOutOfRangeException\nerror message.\nCause\nYour Spark application is trying to fetch expired data offsets from Kafka.\nWe generally see this in these two scenarios:\nScenario 1\nThe Spark application is terminated while processing data. When the Spark application is restarted, it tries to fetch data based on previously calculated data offsets. If any of the data offsets have expired during the time the Spark application was terminated, this issue can occur.\nScenario 2\nYour retention policy is set to a shorter time than the time require to process the batch. By the time the batch is done processing, some of the Kafka partition offsets have expired. The offsets are calculated for the next batch, and if there is a mismatch in the checkpoint metadata due to the expired offsets, this issue can occur.\nSolution\nScenario 1 - Option 1\nDelete the existing checkpoint before restarting the Spark application. A new checkpoint offset is created with the details of the newly fetched offset.\nThe downside to this approach is that some of the data may be missed, because the offsets have expired in Kafka.\nScenario 1 - Option 2\nIncrease the Kafka retention policy of the topic so that it is longer than the time the Spark application is offline.\nNo data is missed with this solution, because no offsets have expired before the Spark application is restarted.\nThere are two types of retention policies:\nTime based retention\n- This type of policy defines the amount of time to keep a log segment before it is automatically deleted. The default time based data retention window for all topics is seven days. You can review the Kafka documentation for\nlog.retention.hours\n,\nlog.retention.minutes\n, and\nlog.retention.ms\nfor more information.\nSize based retention\n- This type of policy defines the amount of data to retain in the log for each topic-partition. This limit is per-partition. This value is unlimited by default. You can review the Kafka documentation for\nlog.retention.bytes\nfor more information.\nDelete\nInfo\nIf multiple retention policies are set, the more restrictive one controls. This can be overridden on a per topic basis.\nReview Kafka’s\nTopic-level configuration\nfor more information on how to set a per topic override.\nScenario 2 - Option 1\nIncrease the retention policy of the partition. This is accomplished in the same way as the solution for\nScenario 1 - Option 2\n.\nScenario 2 - Option 2\nIncrease the number of parallel workers by configuring\n.option(\"minPartitions\",)\nfor\nreadStream\n.\nThe option\nminPartitions\ndefines the minimum number of partitions to read from Kafka. By default, Spark uses a one-to-one mapping of Kafka topic partitions to Spark partitions when consuming data from Kafka. If you set the option\nminPartitions\nto a value greater than the number of your Kafka topic partitions, Spark separates the Kafka topic partitions into smaller pieces.\nThis option is recommended at times of data skew, peak loads, and if your stream is falling behind. Setting this value greater than the default results in the initialization of Kafka consumers at each trigger. This can impact performance if you use SSL when connecting to Kafka." +} \ No newline at end of file diff --git a/scraped_kb_articles/kafka-no-resolvable-bootstrap-urls.json b/scraped_kb_articles/kafka-no-resolvable-bootstrap-urls.json new file mode 100644 index 0000000000000000000000000000000000000000..9742beebda91999b07063a8fd4c5960e5e1e5885 --- /dev/null +++ b/scraped_kb_articles/kafka-no-resolvable-bootstrap-urls.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/kafka-no-resolvable-bootstrap-urls", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to read or write data to a Kafka stream when you get an error message.\nkafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer\r\n\r\nCaused by: kafkashaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers\nIf you are running a notebook, the error message appears in a notebook cell.\nIf you are running a JAR job, the error message appears in the cluster driver and worker logs (\nAWS\n|\nAzure\n|\nGCP\n).\nCause\nThis error message occurs when an invalid hostname or IP address is passed to the\nkafka.bootstrap.servers\nconfiguration in\nreadStream\n.\nThis means a Kafka bootstrap server is not running at the given hostname or IP address.\nSolution\nContact your Kafka admin to determine the correct hostname or IP address for the Kafka bootstrap servers in your environment.\nMake sure you use the correct hostname or IP address when you establish the connection between Kafka and your Apache Spark structured streaming application." +} \ No newline at end of file diff --git a/scraped_kb_articles/keyproviderexception-error-when-trying-to-create-an-external-table-on-an-external-schema-with-authentication-at-the-notebook-level.json b/scraped_kb_articles/keyproviderexception-error-when-trying-to-create-an-external-table-on-an-external-schema-with-authentication-at-the-notebook-level.json new file mode 100644 index 0000000000000000000000000000000000000000..2695f3cb3f6e83ad6b3c7a35bcb4e2153b02742a --- /dev/null +++ b/scraped_kb_articles/keyproviderexception-error-when-trying-to-create-an-external-table-on-an-external-schema-with-authentication-at-the-notebook-level.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/keyproviderexception-error-when-trying-to-create-an-external-table-on-an-external-schema-with-authentication-at-the-notebook-level", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re trying to use a notebook with a service principal for authentication to create an external table on an external schema in a Hive metastore, and receive an error.\n\"KeyProviderException: Failure to initialize configuration for storage account .dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.key\".\nCause\nYou have an invalid configuration value for\nfs.azure.account.key\n. Invalid configuration values occur when the Azure storage account doesn’t have proper access to the service principal.\nSolution\nSet up authorization at the cluster configuration level instead of the notebook level. Set up the Apache Spark configuration on the cluster as an init script. Create an init script with the following code and add it to the cluster configuration.\n```\r\n#!/bin/bash\r\ncat <> /databricks/driver/conf/00-custom-spark.conf\r\n[driver] {\r\n  \"spark.hadoop.fs.azure.account.auth.type..dfs.core.windows.net\" = \"OAuth\"\r\n  \"spark.hadoop.fs.azure.account.oauth.provider.type..dfs.core.windows.net\" = \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\"\r\n  \"spark.hadoop.fs.azure.account.oauth2.client.id..dfs.core.windows.net\" = \"\"\r\n  \"spark.hadoop.fs.azure.account.oauth2.client.secret..dfs.core.windows.net\" = \"\"\r\n  \"spark.hadoop.fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net\" = \"https://login.microsoftonline.com//oauth2/token\"\r\n}\r\nEOF\r\n```\nIf you store your service principal credentials in Databricks secrets, modify the init script to use\nclientId\nand\nclientSec\nas environment variables pointing to the secrets instead. Add the environment variables to the cluster configuration.\n```\r\n#!/bin/bash\r\nclientId={{secrets//}}\r\nclientSec={{secrets//}}\r\ncat <> /databricks/driver/conf/00-custom-spark.conf\r\n[driver] {\r\n  \"spark.hadoop.fs.azure.account.auth.type..dfs.core.windows.net\" = \"OAuth\"\r\n  \"spark.hadoop.fs.azure.account.oauth.provider.type..dfs.core.windows.net\" = \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\"\r\n  \"spark.hadoop.fs.azure.account.oauth2.client.id..dfs.core.windows.net\" = ${clientId}\r\n  \"spark.hadoop.fs.azure.account.oauth2.client.secret..dfs.core.windows.net\" = ${clientSec}\r\n  \"spark.hadoop.fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net\" = \"https://login.microsoftonline.com//oauth2/token\"\r\n}\r\nEOF\r\n```" +} \ No newline at end of file diff --git a/scraped_kb_articles/kfold-cross-validation.json b/scraped_kb_articles/kfold-cross-validation.json new file mode 100644 index 0000000000000000000000000000000000000000..881be29ee4bfc4b27359a40abd42b530e84c3977 --- /dev/null +++ b/scraped_kb_articles/kfold-cross-validation.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/kfold-cross-validation", + "title": "Título do Artigo Desconhecido", + "content": "Cross validation randomly splits the training data into a specified number of folds. To prevent data leakage where the same data shows up in multiple folds you can use groups.\nscikit-learn\nsupports group\nK-fold cross validation\nto ensure that the folds are distinct and non-overlapping.\nOn Spark you can use the\nspark-sklearn\nlibrary, which distributes tuning of\nscikit-learn models\n, to take advantage of this method. This example tunes a\nscikit-learn\nrandom forest model with the group k-fold method on Spark with a\ngrp\nvariable:\n%python\r\n\r\nfrom sklearn.ensemble import RandomForestClassifier\r\nfrom spark_sklearn import GridSearchCV\r\nfrom sklearn.model_selection import GroupKFold\r\nparam_grid = {\"max_depth\": [8, 12, None],\r\n              \"max_features\": [1, 3, 10],\r\n              \"min_samples_split\": [1, 3, 10],\r\n              \"min_samples_leaf\": [1, 3, 10],\r\n              \"bootstrap\": [True, False],\r\n              \"criterion\": [\"gini\", \"entropy\"],\r\n              \"n_estimators\": [20, 40, 80]}\r\ngroup_kfold = GroupKFold(n_splits=3)\r\ngs = GridSearchCV(sc, estimator = RandomForestClassifier(random_state=42), param_grid=param_grid, cv = group_kfold)\r\ngs.fit(X1, y1 ,grp)\nDelete\nInfo\nThe library that is used to run the grid search is called\nspark-sklearn\n, so you must pass in the Spark context (\nsc\nparameter) first.\nThe\nX1\nand\ny1\nparameters must be pandas DataFrames. This grid search option only works on data that fits on the driver." +} \ No newline at end of file diff --git a/scraped_kb_articles/knn-model-pyfunc-modulenotfounderror.json b/scraped_kb_articles/knn-model-pyfunc-modulenotfounderror.json new file mode 100644 index 0000000000000000000000000000000000000000..57c30c6fa6a10856f863aa910197ee7187ca2df5 --- /dev/null +++ b/scraped_kb_articles/knn-model-pyfunc-modulenotfounderror.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/knn-model-pyfunc-modulenotfounderror", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have created a Sklearn model using\nKNeighborsClassifier\nand are using\npyfunc\nto run a prediction.\nFor example:\n%python\r\n\r\nimport mlflow.pyfunc\r\npyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type='string')\r\npredicted_df = merge.withColumn(\"prediction\", pyfunc_udf(*merge.columns[1:]))\r\npredicted_df.collect()\nThe prediction returns a\nModuleNotFoundError: No module named 'sklearn.neighbors._classification'\nerror message.\nThe prediction may also return a\nFileNotFoundError: [Errno 2] No usable temporary directory found\nerror message.\nCause\nWhen a KNN model is logged, all of the data points used for training are saved as part of the pickle file.\nIf the model is trained with millions of records, all of that data is added to the model, which can dramatically increase its size. A model trained on millions of records can easily total multiple GBs.\npyfunc\nattempts to load the entire model into the executor’s cache when running a prediction.\nIf the model is too big to fit into memory, it results in one of the above error messages.\nSolution\nYou should use a tree-based algorithm, such as\nRandom Forest\nor XGBoost to downsample the data in a KNN model.\nIf you have unbalanced data, attempt a sampling method like SMOTE, when training a tree-based algorithm." +} \ No newline at end of file diff --git a/scraped_kb_articles/korean-characters-breaking-when-downloading-genie-results-as-csv.json b/scraped_kb_articles/korean-characters-breaking-when-downloading-genie-results-as-csv.json new file mode 100644 index 0000000000000000000000000000000000000000..2b98bd0feb41357520789e6b1102fd9b7db35da6 --- /dev/null +++ b/scraped_kb_articles/korean-characters-breaking-when-downloading-genie-results-as-csv.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/korean-characters-breaking-when-downloading-genie-results-as-csv", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Genie in Databricks to extract SQL query results and download them as CSV files, you notice Korean characters appear broken or garbled when the file is opened in Microsoft Excel.\nCause\nThe CSV files you download from Genie into Excel are encoded in UTF-8, but are missing the BOM (Byte Order Mark) encoding. Excel by default uses a different default encoding, leading to the misinterpretation of Korean characters.\nSolution\nOpen the CSV file from within Excel instead to set the correct encoding.\nLaunch Excel.\nNavigate to the\nData\ntab.\nClick on\nFrom Text/CSV\n.\nSelect the CSV file you downloaded from Genie.\nIn the data preview window, under\nFile Origin\n, select\n65001: Unicode (UTF-8)\n.\nEnsure the delimiter is correctly identified (usually a comma).\nClick\nLoad\nto import the data." +} \ No newline at end of file diff --git a/scraped_kb_articles/last_altered-column-in-information_schema-not-reflecting-data-modifications.json b/scraped_kb_articles/last_altered-column-in-information_schema-not-reflecting-data-modifications.json new file mode 100644 index 0000000000000000000000000000000000000000..5bee19690de949009805efe5d76ccb396a76cd4e --- /dev/null +++ b/scraped_kb_articles/last_altered-column-in-information_schema-not-reflecting-data-modifications.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/last_altered-column-in-information_schema-not-reflecting-data-modifications", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to determine the last modified timestamp for a Delta table. You rely on the\nlast_altered\ncolumn in the\nsystem.information_schema.tables\n, but notice this column doesn't always reflect the correct timestamp when the table experiences updates or inserts.\nCause\nThe\nlast_altered\ncolumn in the\nsystem.information_schema.tables\ndisplays the timestamp when the table's structure was last modified (such as changes to the table schema, metadata, or table properties).\nIt does not track data modifications such as inserts, updates, or deletes, so does not show a timestamp for the table’s data changes.\nSolution\nTo accurately track when a Delta table’s structure or data was last modified, use the lastModified column returned by the\nDESCRIBE DETAIL\ncommand.\n%sql\r\nDESCRIBE DETAIL ..;\nDESCRIBE DETAIL\nreturns comprehensive table information, including the timestamp that reflects the most recent modification to the table's data or structure." +} \ No newline at end of file diff --git a/scraped_kb_articles/launch-the-web-terminal-on-a-dcs-enabled-cluster.json b/scraped_kb_articles/launch-the-web-terminal-on-a-dcs-enabled-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..745fece0f4eee9963b4f587d1e92675bfdd554ed --- /dev/null +++ b/scraped_kb_articles/launch-the-web-terminal-on-a-dcs-enabled-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/launch-the-web-terminal-on-a-dcs-enabled-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe Databricks web terminal offers an interactive CLI for running shell commands. The web terminal service is proxied on port 7681 on the cluster’s Apache Spark driver. Databricks Container Services (DCS) lets you specify a Docker image when you create a cluster.\nHowever, you notice when you use a custom Docker image, the web terminal is disabled by default.\nCause\nEnabling DCS will disable the web terminal on the cluster. Access to the Workspace File System is not permitted from the web terminal.\nSolution\nWarning\nThe following method is a custom workaround intended to launch the web terminal on a DCS-enabled cluster. It is provided as-is and is not covered under any formal Service Level Agreement (SLA).\nTo access the web terminal on a DCS cluster, you can use the\nttyd\npackage in your Dockerfile and launch it over port 7681.\nttyd\nis a lightweight, web-based terminal emulator.\nAdd the\nttyd\npackage to your Dockerfile. This is a sample Dockerfile that installs the package along with its dependencies.\n# Use the databricksruntime/standard:11.3-LTS image from as a sample base image\r\nFROM databricksruntime/standard:11.3-LTS\r\n\r\n# Install required dependencies\r\nRUN apt-get update && apt-get install -y \\\r\n   openssh-server gnupg2 build-essential cmake wget git zlib1g-dev libuv1-dev \\\r\n   libwebsockets-dev libssl-dev libjson-c-dev acl \\\r\n   && apt-get clean \\\r\n   && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*\r\n\r\n# Download and compile ttyd from source\r\nRUN wget https://github.com/tsl0922/ttyd/archive/refs/tags/1.6.3.tar.gz \\\r\n   && tar -xvf 1.6.3.tar.gz \\\r\n   && cd ttyd-1.6.3 \\\r\n   && mkdir build \\\r\n   && cd build \\\r\n   && cmake .. \\\r\n   && make \\\r\n   && make install \\\r\n   && cd ../.. \\\r\n   && rm -rf ttyd-1.6.3 1.6.3.tar.gz\r\n\r\n# Set the default command to run ttyd with /bin/bash\r\nCMD [\"ttyd\", \"/bin/bash\"]\nNavigate to your workspace's home folder.\nClick\nCreate\nand select\nFile\n.\nName the file\nweb-terminal.sh\nand add the following shell script contents.\n#!/bin/bash\r\n\r\n# Wait for the cluster to come up\r\nsleep 30\r\n\r\n# Run ttyd in the background\r\nttyd -p 7681 bash &\nConfigure the cluster accordingly with the Docker image URL. This can be done using the UI or the API.\nTo launch from the UI:\nNavigate to the\nCreate\ncompute page, under\nAccess mode\nselect\nDedicated\n.\nSpecify a Databricks Runtime version that supports DCS.\nUnder\nAdvanced options\n, select the\nDocker\ntab.\nIn the\nDocker image URL\nfield, enter your custom Docker image with the appropriate authentication type.\nUnder the\nInit Scripts\n>\nSource\ntab, select\nWorkspace\nand provide the file path to the init script created in the previous section.\nStart the cluster. After the cluster spins up in the UI, navigate to the\nApps\ntab, click the\nWeb Terminal\nbutton. This launches the web terminal.\nTo use the API, refer to the\nCreate new cluster\n(\nAWS\n|\nAzure\n) API documentation.\nFor more information, refer to the\nCustomize containers with Databricks Container Service\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/learn-about-apache-hive-metastore-costs.json b/scraped_kb_articles/learn-about-apache-hive-metastore-costs.json new file mode 100644 index 0000000000000000000000000000000000000000..b6a1379c067060f7a5f00d8f1abefaa782741bc7 --- /dev/null +++ b/scraped_kb_articles/learn-about-apache-hive-metastore-costs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/learn-about-apache-hive-metastore-costs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using the standard workspace and want to know if there are any fees or additional costs for using the Apache Hive metastore.\nCause\nThe Hive metastore itself does not incur any direct costs. Any associated costs are related to actual storage, compute usage, and network traffic.\nSolution\nWhile the Hive metastore itself does not incur direct costs, customers need to be aware of the associated costs that arise from its usage. These costs primarily fall into three categories: compute costs, network costs, and storage costs.\nCompute costs\nCompute costs depend on the size and number of clusters used to process the data stored in the Hive metastore. Queries and transformations involving large datasets or frequent table updates can significantly impact compute costs.\nUse the Databricks\nPrice Calculator\nto estimate cluster usage and costs.\nNetwork costs\nNetwork costs are associated with data transfers between the Databricks workspace, Hive metastore, and storage locations. High volumes of data movement can lead to additional egress and ingress costs, particularly for cross-region or multi-cloud deployments.\nWork with your cloud networking team to assess data transfer costs.\nMinimize unnecessary data movement by designing efficient data pipelines within the same regions.\nStorage costs\nThe storage costs depend on the amount of data stored in the Hive metastore-related storage. The default location for managed tables in the Hive metastore on Databricks is the Databricks File System (DBFS) root. To prevent unintended storage usage by end users who create managed tables in hive metastore, it is recommended to specify an external storage location when creating databases in the Hive metastore.\nTrack storage usage and associated costs in the cloud provider’s console (\nAWS Console\n,\nAzure Portal\n, or\nGoogle Cloud Console\n).\nSet up alerts and budget limits within the cloud provider to avoid unexpected costs." +} \ No newline at end of file diff --git a/scraped_kb_articles/left-join-resulting-in-null-values-when-joining-timestamp-column-and-date-column.json b/scraped_kb_articles/left-join-resulting-in-null-values-when-joining-timestamp-column-and-date-column.json new file mode 100644 index 0000000000000000000000000000000000000000..1256f807517aa1ad8aa369a8c109173e4d070c7e --- /dev/null +++ b/scraped_kb_articles/left-join-resulting-in-null-values-when-joining-timestamp-column-and-date-column.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/left-join-resulting-in-null-values-when-joining-timestamp-column-and-date-column", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen joining two dataframes, joining a timestamp column with a date column results in null values.\nExample\nIn this example,\nstart_timestamp\nis of timestamp data type, and\nstart_date\nis of date data type.\nselect * from table1\r\nleft join table2\r\non\r\ntable1.start_timestamp = table2.start_date\nCause\nA join between a timestamp and a date column will produce non-null results only if the time in the timestamp column is 00:00:00 UTC.\nAdditionally, if\nspark.sql.session.timeZone\nis set to a timezone other than UTC, 00:00:00 UTC is converted to the time as per the set timezone, leading to null results during the join.\nSolution\nCast the value of the timestamp column to date datatype when joining it with a column of 'date' datatype.\nselect * from table1 t1\r\nleft join table2 t2\r\non\r\nto_date(t1.start_timestamp, 'yyyy-MM-dd') = t2.start_date" +} \ No newline at end of file diff --git a/scraped_kb_articles/legacy-global-init-script-migration-notebook.json b/scraped_kb_articles/legacy-global-init-script-migration-notebook.json new file mode 100644 index 0000000000000000000000000000000000000000..f84dab15d763aec13bc6562e662d7200a79aaa60 --- /dev/null +++ b/scraped_kb_articles/legacy-global-init-script-migration-notebook.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/legacy-global-init-script-migration-notebook", + "title": "Título do Artigo Desconhecido", + "content": "On Dec 1, 2023, Databricks will disable legacy global init scripts for all workspaces. This type of init script was deprecated in 2020 and will not be usable after Dec 1, 2023. The legacy global init script was replaced in 2020 by the more reliable current global init script framework, which continues to be supported.\nDatabricks recommends that you migrate your legacy global init scripts to the current global init script framework as soon as possible.\nYou can follow the documentation to manually\nMigrate from legacy to new global init scripts\n(\nAWS\n|\nAzure\n). Alternatively, Databricks Engineering has created a notebook to help automate the migration process.\nDelete\nInfo\nThis article does not apply to GCP workspaces. Legacy global init scripts were never available on GCP workspaces.\nInstructions\nDelete\nWarning\nYou must be a Databricks admin to run this migration notebook.\nPrerequisites\nBefore running the migration notebook, you need to have the scope name and secret name for your Personal Access Token.\nFor more information, please review the\nCreate a Databricks-backed secret scope\n(\nAWS\n|\nAzure\n) and the\nCreate a secret in a Databricks-backed scope\n(\nAWS\n|\nAzure\n) documentation.\nDo a dry run\nExecuting a dry run allows you to test the migration notebook in your workspace without making any changes.\nDownload the\nMigrate legacy global init scripts notebook\n.\nImport the notebook to your workspace.\nStart a cluster.\nRun the notebook.\nA UI screen appears after you run the notebook. Enter the\nScope Name\nand\nSecret Name\ninto the appropriate fields.\nAfter updating the settings, run the notebook a second time.\nIf there are no errors, you are ready to migrate your legacy global init scripts.\nMigrate your legacy global init scripts\nYou should always do a dry run before making changes to your system.\nRun the Migrate legacy global init scripts notebook.\nA UI screen appears after you run the notebook. Enter the\nScope Name\nand\nSecret Name\ninto the appropriate fields.\nIn the\nDry Run\ndrop down menu, select\nFalse\n. This allows the notebook to make changes to your workspace.\nIn the\nEnable New Scripts\ndrop down menu, select\nTrue\n. This enables the current global init script framework on your workspace. Select\nFalse\nif you want to migrate the scripts, but not enable them.\nIn the\nRemove Old Scripts\ndrop down menu, select\nTrue\n. This removes the legacy global init scripts.\nIn the\nRevert Changes\ndrop down menu, select\nFalse\n. This should only be set to\nTrue\nif you need to undo changes for a specific reason.\nAfter updating the settings, run the notebook a second time.\nOnce the notebook finishes running, all of your legacy global init scripts are migrated.\nValidate the migrated scripts\nOpen the\nAdmin Console\n.\nClick\nGlobal init scripts\n.\nVerify that all of your legacy global init scripts are present.\nToggle the\nEnabled\nswitch to turn individual scripts on or off.\nClick\nEdit ordering\nto change the order of the scripts.\nRestart your clusters and verify that they run as expected." +} \ No newline at end of file diff --git a/scraped_kb_articles/libraries-failing-with-owner-or-network-errors-on-databricks-runtime-133-lts-current-153.json b/scraped_kb_articles/libraries-failing-with-owner-or-network-errors-on-databricks-runtime-133-lts-current-153.json new file mode 100644 index 0000000000000000000000000000000000000000..5b6c8fe04107ce4e985bd32e28e16d9a3f5d6b96 --- /dev/null +++ b/scraped_kb_articles/libraries-failing-with-owner-or-network-errors-on-databricks-runtime-133-lts-current-153.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/libraries-failing-with-owner-or-network-errors-on-databricks-runtime-133-lts-current-153", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen migrating to Databricks Runtime 13.3 LTS to current (15.3), libraries start failing with owner or network related errors.\nExample\nLibrary installation attempted on the driver node of cluster XXXX-XXXXXX-XXXXXXXX and failed. Please refer to the following error message to fix the library or contact Databricks support. Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install 'petastorm==0.12.0' --disable-pip-version-check) exited with code 1. WARNING: The directory '/home/libraries/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\nCause\nThe Python library installation index defaults to\nhttps://pypi.org/simple.\nDue to security enhancements, on Databricks Runtime 13.3 LTS to current (15.3), libraries are installed as a non-root user. For more information, please review the\nCluster-scoped Python libraries are installed using a non-root user\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you have set a global and/or cluster-scoped init script to exchange the default index for a custom repository (for example, pointing to an artifact one), this index will not be applicable to the new user if you do not set it as a\nglobal index\n.\nSolution\nInstalled via init script\nAdjust your init script index so it uses the\n--global\nflag and points to your custom index URL.\n/databricks/python/bin/pip config --global set global.index-url \nInstalled via workspace UI\nOpen the cluster properties and click on\nLibraries\n.\nSelect a library, or click\nInstall new\nto install a new library.\nSet the custom index URL in the Index\nURL field\n.\nNote\nIf installing via the workspace UI, you must individually set the custom index URL for every library that requires one." +} \ No newline at end of file diff --git a/scraped_kb_articles/library-fail-dependency-exception.json b/scraped_kb_articles/library-fail-dependency-exception.json new file mode 100644 index 0000000000000000000000000000000000000000..47a0918a5ab4a38143de4936f656c439bad2eea6 --- /dev/null +++ b/scraped_kb_articles/library-fail-dependency-exception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/library-fail-dependency-exception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a Python function that is defined in a custom egg or wheel file and also has dependencies that are satisfied by another customer package installed on the cluster.\nWhen you call this function, it returns an error that says the requirement cannot be satisfied.\norg.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnvDirs/virtualEnv-d82b31df-1da3-4ee9-864d-8d1fce09c09b/bin/python, /local_disk0/pythonVirtualEnvDirs/virtualEnv-d82b31df-1da3-4ee9-864d-8d1fce09c09b/bin/pip, install, fractal==0.1.0, --disable-pip-version-check) exited with code 1. Could not find a version that satisfies the requirement fractal==0.1.0 (from versions: 0.1.1, 0.1.2, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0)\nAs an example, imagine that you have both wheel A and wheel B installed, either to the cluster via the UI or via notebook-scoped libraries. Assume that wheel A has a dependency on wheel B.\ndbutils.library.install(/path_to_wheel/A.whl)\ndbutils.library.install(/path_to_wheel/B.whl)\nWhen you try to make a call using one of these libraries, you get a requirement cannot be satisfied error.\nCause\nEven though the requirements have been met by installing the required dependencies via the cluster UI or via a notebook-scoped library installation, Databricks cannot guarantee the order in which specific libraries are installed on the cluster. If a library is being referenced and it has not been distributed to the executor nodes, it will fallback to PyPI and use it locally to satisfy the requirement.\nSolution\nYou should use one egg or wheel file that contains all required code and dependencies. This ensures that your code has the correct libraries loaded and available at run time." +} \ No newline at end of file diff --git a/scraped_kb_articles/library-fail-transient-maven.json b/scraped_kb_articles/library-fail-transient-maven.json new file mode 100644 index 0000000000000000000000000000000000000000..387d37b80cc7a56e1ec5c7f1513a4c029193f3a1 --- /dev/null +++ b/scraped_kb_articles/library-fail-transient-maven.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/library-fail-transient-maven", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nJob fails because libraries cannot be installed.\nLibrary resolution failed. Cause: java.lang.RuntimeException: Cannot download some libraries due to transient Maven issue. Please try again later\nCause\nAfter a Databricks upgrade, your cluster attempts to download any required libraries from Maven. After downloading, the libraries are stored as Workspace libraries. The next time you run a cluster that requires the libraries, they are loaded as Workspace libraries.\nThis behavior is due to legacy code.\nSolution\nThere is no workaround for this issue.\nThis only happens after a maintenance window, and only if Maven is unavailable at the time. It is a rare corner case." +} \ No newline at end of file diff --git a/scraped_kb_articles/library-install-latency.json b/scraped_kb_articles/library-install-latency.json new file mode 100644 index 0000000000000000000000000000000000000000..fd7066bcad576a83485eeca31e44ac5298fa50dc --- /dev/null +++ b/scraped_kb_articles/library-install-latency.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/library-install-latency", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are launching jobs that import external libraries and get an Import Error.\nWhen a job causes a node to restart, the job fails with the following error message:\nImportError: No module named XXX\nCause\nThe Cluster Manager is part of the Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R libraries when it restarts each node. Sometimes, library installation or downloading of artifacts from the internet can take more time than expected. This occurs due to network latency, or it occurs if the library that is being attached to the cluster has many dependent libraries.\nThe library installation mechanism guarantees that when a notebook attaches to a cluster, it can import installed libraries. When library installation through PyPI takes excessive time, the notebook attaches to the cluster before the library installation completes. In this case, the notebook is unable to import the library.\nSolution\nMethod 1\nUse notebook-scoped library installation commands in the notebook. You can enter the following commands in one cell, which ensures that all of the specified libraries are installed.\n%sh\r\n\r\ndbutils.library.installPyPI(\"mlflow\")\r\ndbutils.library.restartPython()\nMethod 2\nAWS\nTo avoid delay in downloading the libraries from the internet repositories, you can cache the libraries in DBFS or S3.\nFor example, you can download the wheel or egg file for a Python library to a DBFS or S3 location. You can use the REST API or cluster-scoped init scripts to install libraries from DBFS or S3.\nFirst, download the wheel or egg file from the internet to the DBFS or S3 location. This can be performed in a notebook as follows:\nDelete\nAzure\nTo avoid delay in downloading the libraries from the internet repositories, you can cache the libraries in DBFS or Azure Blob Storage.\nFor example, you can download the wheel or egg file for a Python library to a DBFS or Azure Blob Storage location. You can use the REST API or cluster-scoped init scripts to install libraries from DBFS or Azure Blob Storage.\nFirst, download the wheel or egg file from the internet to the DBFS or Azure Blob Storage location. This can be performed in a notebook as follows:\nDelete\n%sh\r\n\r\ncd /dbfs/mnt/library\r\nwget \nAfter the wheel or egg file download completes, you can install the library to the cluster using the REST API, UI, or init script commands." +} \ No newline at end of file diff --git a/scraped_kb_articles/library-installation-attempted-on-the-driver-node-of-the-cluster-failed.json b/scraped_kb_articles/library-installation-attempted-on-the-driver-node-of-the-cluster-failed.json new file mode 100644 index 0000000000000000000000000000000000000000..854740f6972728f261c61b0f369e5a21fe871a15 --- /dev/null +++ b/scraped_kb_articles/library-installation-attempted-on-the-driver-node-of-the-cluster-failed.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/library-installation-attempted-on-the-driver-node-of-the-cluster-failed", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to install libraries from Unity Catalog (UC) volumes on a shared cluster, the process fails with the following error.\nLibrary installation attempted on the driver node of the cluster failed. \r\nPlease contact Databricks support to resolve the issue. Error Code: INVALID_ACCESS_TOKEN_ERROR. Error Message: java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: 403: Invalid access token.\nCause\nThe original creator of the cluster no longer has workspace access, and in UC-shared clusters, Databricks validates library installation requests using the user’s identity token. As a result, any actions tied to the former user’s identity fail, including library installations.\nSolution\nUninstall the problematic library from the cluster where the issue occurred.\nHave the new user (the one currently managing the cluster) reinstall the library.\nThis ensures the library installation is performed using the current user’s identity token, which has valid permissions." +} \ No newline at end of file diff --git a/scraped_kb_articles/list-all-available-tables-and-their-source-formats-in-unity-catalog.json b/scraped_kb_articles/list-all-available-tables-and-their-source-formats-in-unity-catalog.json new file mode 100644 index 0000000000000000000000000000000000000000..e1d08893c252fe6f3ced7234553e8d13de2b35cf --- /dev/null +++ b/scraped_kb_articles/list-all-available-tables-and-their-source-formats-in-unity-catalog.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/list-all-available-tables-and-their-source-formats-in-unity-catalog", + "title": "Título do Artigo Desconhecido", + "content": "You may want to get a list of all the Delta tables and non-Delta tables available in your Unity Catalog instance.\nYou can use these sample SQL queries to get a table names and the corresponding data source format.\nInstructions\nInfo\nMake sure you have permission to access Unity Catalog. You will not be able to view information on tables if you don't have the appropriate permission.\nMake sure your cluster has permission to access Unity Catalog.\nUse this sample SQL query to get a list of all the available Delta tables in your Unity Catalog.\n%sql\r\n\r\nSELECT table_name, data_source_format \r\nFROM system.information_schema.tables where data_source_format = \"DELTA\";\nUse this sample SQL query to get a list of all the available tables and their source formats in your Unity Catalog.\n%sql\r\n\r\nSELECT table_name, data_source_format \r\nFROM system.information_schema.tables;" +} \ No newline at end of file diff --git a/scraped_kb_articles/list-all-workspace-objects.json b/scraped_kb_articles/list-all-workspace-objects.json new file mode 100644 index 0000000000000000000000000000000000000000..2d76479d807360d8d2aeac8a61207182f9d04e98 --- /dev/null +++ b/scraped_kb_articles/list-all-workspace-objects.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/list-all-workspace-objects", + "title": "Título do Artigo Desconhecido", + "content": "You can use the Databricks Workspace API (\nAWS\n|\nAzure\n|\nGCP\n) to recursively list all workspace objects under a given path.\nCommon use cases for this include:\nIndexing all notebook names and types for all users in your workspace.\nUse the output, in conjunction with other API calls, to delete unused workspaces or to manage notebooks.\nDynamically get the absolute path of a notebook under a given user, and submit that to the Databricks Jobs API to trigger notebook-based jobs (\nAWS\n|\nAzure\n|\nGCP\n).\nDefine function\nThis example code defines the function and the logic needed to run it.\nYou should place this code at the beginning of your notebook.\nYou need to replace\n\nwith your personal access token (\nAWS\n|\nAzure\n|\nGCP\n).\n%python\r\n\r\nimport requests\r\nimport json\r\nfrom ast import literal_eval\r\n\r\n# Authorization\r\nheaders = {\r\n  'Authorization': 'Bearer ',\r\n}\r\n\r\n# Define rec_req as a function.\r\n# Note: Default path is \"/\" which scans all users and folders.\r\n\r\ndef rec_req(instanceName,loc=\"/\"):\r\n data_path = '{{\"path\": \"{0}\"}}'.format(loc)\r\n instance = instanceName\r\n url = '{}/api/2.0/workspace/list'.format(instance)\r\n response = requests.get(url, headers=headers, data=data_path)\r\n # Raise exception if a directory or URL does not exist.\r\n response.raise_for_status()\r\n jsonResponse = response.json()\r\n for i,result in jsonResponse.items():\r\n   for value in result:\r\n    dump = json.dumps(value)\r\n    data = literal_eval(dump)\r\n    if data['object_type'] == 'DIRECTORY':\r\n     # Iterate through all folders.\r\n     rec_req(instanceName,data['path'])\r\n    elif data['object_type'] == 'NOTEBOOK':\r\n     # Return the notebook path.\r\n     print(data)\r\n    else:\r\n     # Skip imported libraries.\r\n     pass\nRun function\nOnce you have defined the function in your notebook, you can call it at any time.\nYou need to replace\n\nwith the instance name (\nAWS\n|\nAzure\n|\nGCP\n) of your Databricks deployment. This is typically the URL, without any workspace ID.\nYou need to replace\n\nwith the full path you want to search. This is typically\n/\n.\n%python\r\n\r\n\r\nrec_req(\"https://\", \"\")\nDelete\nInfo\nYou should NOT include a trailing\n/\nas the last character of the instance name. The function generates an error if a trailing\n/\nis included." +} \ No newline at end of file diff --git a/scraped_kb_articles/list-tables.json b/scraped_kb_articles/list-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..1fb69dec5f51771edb0a7ade8fe85f4c8e6e0db1 --- /dev/null +++ b/scraped_kb_articles/list-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/list-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nTo fetch all the table names from metastore you can use either\nspark.catalog.listTables()\nor\n%sql show tables\n. If you observe the duration to fetch the details you can see\nspark.catalog.listTables()\nusually takes longer than\n%sql show tables\n.\nCause\nspark.catalog.listTables()\ntries to fetch every table’s metadata first and then show the requested table names. This process is slow when dealing with complex schemas and larger numbers of tables.\nSolution\nTo get only the table names, use\n%sql show tables\nwhich internally invokes\nSessionCatalog.listTables\nwhich fetches only the table names." +} \ No newline at end of file diff --git a/scraped_kb_articles/list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table.json b/scraped_kb_articles/list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table.json new file mode 100644 index 0000000000000000000000000000000000000000..dd99befc1a9760e6af89b6f86eded0ec43a034b8 --- /dev/null +++ b/scraped_kb_articles/list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/list-users-who-executed-select-command-against-a-list-of-schemas-using-system-access-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou wish to track the usage history (select queries) for specific tables in Databricks. You may encounter difficulties in identifying which users have executed\nSELECT\ncommands against arbitrary schemas/tables.\nCause\nYou may not be familiar with writing the SQL queries needed to access system tables information.\nSolution\nThe audit logs can be used to identify users who have performed\ngetTable\nactions (i.e.\nSELECT\ncommands) against the specified tables. The audit logs are stored in the\nsystem.access.audit\ntable, which can be queried using Apache Spark SQL.\nAdd the desired date range to the query\nevent_date\nparameter and include the full path to the schemas you wish to audit (i.e.\ncatalog.schema\n).\n%sql\r\n\r\nSELECT \r\n event_time,\r\n event_date,\r\n user_identity.email AS user_email,\r\n service_name,\r\n action_name,\r\n request_params,\r\n source_ip_address,\r\n user_agent\r\nFROM system.access.audit\r\nWHERE action_name = 'getTable'\r\nAND (\r\n request_params['full_name_arg'] LIKE '%%'\r\n OR request_params['full_name_arg'] LIKE '%%'\r\n OR request_params['full_name_arg'] LIKE '%%'\r\n)\r\nAND event_date BETWEEN '' AND ''\r\nORDER BY event_time DESC;\nThe query output contains all tables that have been selected which are part of the schemas added. The\nrequest_params\nfield contains the table name, and\nuser_identity\nis the user that performed the action. If there is no output, it means no users have selected the tables under the schemas you included in the query." +} \ No newline at end of file diff --git a/scraped_kb_articles/listing-hive-metastore-tables-in-catalog-explorer-failing-with-error-getting-schemas.json b/scraped_kb_articles/listing-hive-metastore-tables-in-catalog-explorer-failing-with-error-getting-schemas.json new file mode 100644 index 0000000000000000000000000000000000000000..4c6a2d5c2064f1a1d6136ae5e297f4b312b079e7 --- /dev/null +++ b/scraped_kb_articles/listing-hive-metastore-tables-in-catalog-explorer-failing-with-error-getting-schemas.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/listing-hive-metastore-tables-in-catalog-explorer-failing-with-error-getting-schemas", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to list your Hive metastore tables in Catalog Explorer using a Unity Catalog-enabled cluster, you receive the following error.\nError getting schemas\r\nsummary: SparkException: Process List(/bin/su, spark-) exited with code 1. \",\"\\tat org.apache.spark.util.Utils$.executeAndGetOutputInternal(Utils.scala:1433)\",\"\\tat org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1367)\",\"\\tat\nCause\nYou have the Apache Spark configuration\nspark.databricks.session.share\nset to\ntrue\n.\nSolution\nRemove the Spark configuration\nspark.databricks.session.share true\n.\nWhen set to true, this configuration helps share a single Spark session across different notebooks so views created in one notebook can be used from another. However, the configuration is outdated and it’s best to avoid using it." +} \ No newline at end of file diff --git a/scraped_kb_articles/loading-models-using-mlflow-causes-typeerror-around-unexpected-number-of-arguments.json b/scraped_kb_articles/loading-models-using-mlflow-causes-typeerror-around-unexpected-number-of-arguments.json new file mode 100644 index 0000000000000000000000000000000000000000..76e3af788782b68e49c8f1ea503e5b84edfdcc0e --- /dev/null +++ b/scraped_kb_articles/loading-models-using-mlflow-causes-typeerror-around-unexpected-number-of-arguments.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/loading-models-using-mlflow-causes-typeerror-around-unexpected-number-of-arguments", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a registered MLflow model in your workspace. You want to generate predictions on a PySpark DataFrame and save the results to a table. You use MLflow in a notebook on an all-purpose cluster to load the model as an Apache Spark user-defined function (UDF) and encounter the following error.\nCode that produces the error\nloaded_model = mlflow.pyfunc.spark_udf(spark, model_uri= model_uri, result_type=\"float\")\r\n\r\ndf_result = df.withColumn('predictions', loaded_model(struct(*map(col, df.columns))))\r\n\r\ndf_result.write.mode('overwrite').option(\"mergeSchema\", \"true\").saveAsTable('')\nStack trace for the error\nThe cluster used to log the model has a different Databricks Runtime version (15.4 LTS ML) than the one used to load the model (14.3 LTS ML).\nPythonException: \r\n  An exception was thrown from the Python worker. Please see the stack trace below.\r\nTraceback (most recent call last):\r\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXXX/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py\", line 1914, in udf\r\n    loaded_model = mlflow.pyfunc.load_model(local_model_path)\r\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXXX/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py\", line 857, in load_model\r\n    except ModuleNotFoundError as e:\r\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXXX/lib/python3.10/site-packages/mlflow/pyfunc/model.py\", line 468, in _load_pyfunc\r\n    context, python_model, signature = _load_context_model_and_signature(model_path, model_config)\r\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXXX/lib/python3.10/site-packages/mlflow/pyfunc/model.py\", line 450, in _load_context_model_and_signature\r\n    python_model = cloudpickle.load(f)\r\nTypeError: code expected at most 16 arguments, got 18\nNote\nThe error may vary slightly depending on the Databricks Runtimes involved, but the following cause and solution are applicable to similar scenarios.\nCause\nThe cloudpickle package and Python version you used to log and register the model differ from the versions present in the Databricks Runtime used to load the model.\nCloudpickle can only be used to send objects between the exact same version of Python. For details, refer to the\ncloudpickle documentation\n.\nSolution\nWhen you load the model using MLflow to generate single predictions or as a Spark UDF for batch inference, ensure that you’re using a cluster with a Databricks Runtime that matches the Python and cloudpickle versions used to log and register the model.\nTo check for correspondence, either run the\n!python --version\ncommand in a notebook cell or run the following code snippet in a notebook cell.\nimport sys\r\nprint(sys.version)" +} \ No newline at end of file diff --git a/scraped_kb_articles/log-delivery-feature-not-generating-log4j-logs-for-executor-folders.json b/scraped_kb_articles/log-delivery-feature-not-generating-log4j-logs-for-executor-folders.json new file mode 100644 index 0000000000000000000000000000000000000000..f209097d73b3a69ec01ff039152a7d7ad48e4221 --- /dev/null +++ b/scraped_kb_articles/log-delivery-feature-not-generating-log4j-logs-for-executor-folders.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/log-delivery-feature-not-generating-log4j-logs-for-executor-folders", + "title": "Título do Artigo Desconhecido", + "content": "Problem:\nWhen using the log delivery feature to ship job/cluster logs to an S3 bucket, the driver folder has a\nlog4j\nlog file but none of the executor folders have it.\nSolution:\nThere is no issue with the log delivery feature. It is designed to generate a\nlog4j\nlog file only for the driver folder and not for the executor folders. This behavior is by design and does not indicate any problem with the feature or the cluster.\nThe\nlog4j\nlog file in the driver folder contains the logs specific to the driver node, while the executor folders contain logs specific to the executor nodes.\nIf you need to access the logs from the executor nodes, you can use other logging mechanisms such as Apache Spark's internal logging or custom loggers within your application code." +} \ No newline at end of file diff --git a/scraped_kb_articles/logging-a-model-with-mlflow-in-a-pyspark-pipeline-throws-a-tempdir-class-assertion-error-.json b/scraped_kb_articles/logging-a-model-with-mlflow-in-a-pyspark-pipeline-throws-a-tempdir-class-assertion-error-.json new file mode 100644 index 0000000000000000000000000000000000000000..2ef5733fa3ca806ea49acf0e57217a5ca79f8421 --- /dev/null +++ b/scraped_kb_articles/logging-a-model-with-mlflow-in-a-pyspark-pipeline-throws-a-tempdir-class-assertion-error-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/logging-a-model-with-mlflow-in-a-pyspark-pipeline-throws-a-tempdir-class-assertion-error-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to log a model with MLflow in a PySpark pipeline, you encounter an assertion error related to the TempDir class in MLflow.\nAn error occurred during model logging:\r\n %s Traceback (most recent call last):\r\n File \"/databricks/python/lib/python3.10/site-packages/retail_sales_data_product/training/transform.py\", line 266, in log_model\r\n    mlflow.xgboost.log_model(\r\n File \"/databricks/python/lib/python3.10/site-packages/mlflow/xgboost/__init__.py\", line 270, in log_model\r\n    return Model.log(\r\n File \"/databricks/python/lib/python3.10/site-packages/mlflow/models/model.py\", line 620, in log\r\n    with TempDir() as tmp:\r\n File \"/databricks/python/lib/python3.10/site-packages/mlflow/utils/file_utils.py\", line 426, in __exit__\r\n    assert os.path.exists(os.getcwd())\r\nAssertionError\r\nEnding MLflow run\nCause\nMLflow is attempting to verify the current working directory’s existence but the working directory has become invalid.\nSolution\nUpgrade your MLflow version to 2.16.0 or higher.\nAlternatively, you can upgrade your Databricks runtime to version 13.3 LTS or above, which comes with the latest version of MLflow.\nFor more detail on pre-installed library versions in Databricks Runtime, please refer to the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/losing-data-while-migrating-delta-table-between-workspaces.json b/scraped_kb_articles/losing-data-while-migrating-delta-table-between-workspaces.json new file mode 100644 index 0000000000000000000000000000000000000000..b4b1c07344485dfe0ce2fcdb8613f4f2ae5fc560 --- /dev/null +++ b/scraped_kb_articles/losing-data-while-migrating-delta-table-between-workspaces.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/losing-data-while-migrating-delta-table-between-workspaces", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen migrating a Delta table between workspaces, multiple users in an organization copy the data to migrate. You subsequently notice file losses and failures that corrupt the Delta table.\nCause\nConnectivity or network issues can interrupt manual copying of Terabytes of data. These issues lead to data losses and failures.\nSolution\nFirst, set up Delta Sharing for the Delta table from which you are migrating the data. Follow the instructions in the\nWhat is Delta Sharing?\n(\nAWS\n|\nAzure\n) documentation.\nAfter setting up Delta Sharing, clone the shared Delta table. Follow the instructions in the\nClone a table on Databricks\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/lost-workers-due-to-autoscaling-events.json b/scraped_kb_articles/lost-workers-due-to-autoscaling-events.json new file mode 100644 index 0000000000000000000000000000000000000000..985f26210c1aca4ca09e86a9ddd90d7f05524545 --- /dev/null +++ b/scraped_kb_articles/lost-workers-due-to-autoscaling-events.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/lost-workers-due-to-autoscaling-events", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are reviewing your cluster driver logs and you see\nTaskSchedulerImpl: Lost executor\nerrors, even though there are no signals of high resource utilization, networking issues, or any other issue.\nTaskSchedulerImpl: Lost executor 101 on 10.162.54.201: worker lost: 10.162.54.201:39561 got disassociated\nCause\nThis message may show up in the cluster driver logs when a worker gets terminated due to a downscaling event, manual termination, or spot instance termination.\nSolution\nThe error message can be safely ignored if the worker is lost due to a downscaling event, manual termination, or spot instance termination." +} \ No newline at end of file diff --git a/scraped_kb_articles/manage-size-delta-table.json b/scraped_kb_articles/manage-size-delta-table.json new file mode 100644 index 0000000000000000000000000000000000000000..ef80158c73031c18c57fff161df92a88d5080b72 --- /dev/null +++ b/scraped_kb_articles/manage-size-delta-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/manage-size-delta-table", + "title": "Título do Artigo Desconhecido", + "content": "Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space.\nIn this article we discuss recommendations that can help you manage the size of your Delta tables.\nEnable file system versioning\nWhen you enable file system versioning, you keep multiple variants of your data in the same storage bucket. The file system creates versions of your data, instead of deleting items, which increases the storage space available for your Delta table.\nEnable bloom filters\nA Bloom filter index (\nAWS\n|\nAzure\n|\nGCP\n) is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file associated with it. Before reading a file Databricks checks the index file and the file is read only if the index indicates that the file might match a data filter.\nThe size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created and the required false positive probability (FPP). The lower the FPP, the higher the number of used bits per element and the more accurate it will be, at the cost of more storage space.\nReview your Delta\nlogRetentionDuration\npolicy\nLog files are retained for 30 days by default. This value is configurable through the delta.logRetentionDuration property. You can set a value for this property with the\nALTER TABLE SET TBLPROPERTIES\nSQL method. The more days you retain, the more storage space you consume. For example if you set\ndelta.logRetentionDuration = '365 days'\nit keeps the log files for 365 days instead of the default of 30 days.\nVACUUM\nyour Delta table\nVACUUM (\nAWS\n|\nAzure\n|\nGCP\n) removes data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days. Databricks does not automatically trigger\nVACUUM\noperations on Delta tables. You must run this command manually.\nVACUUM\nhelps you delete obsolete files that are no longer needed.\nOPTIMIZE\nyour Delta table\nThe OPTIMIZE (\nAWS\n|\nAzure\n|\nGCP\n) command compacts multiple Delta files into large single files. This improves the overall query speed and performance of your Delta table by helping you avoid having too many small files around. By default,\nOPTIMIZE\ncreates 1GB files." +} \ No newline at end of file diff --git a/scraped_kb_articles/match-parquet-schema.json b/scraped_kb_articles/match-parquet-schema.json new file mode 100644 index 0000000000000000000000000000000000000000..e78c029dbac478944f55310f3513a0aedc761356 --- /dev/null +++ b/scraped_kb_articles/match-parquet-schema.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/match-parquet-schema", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nLet’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. You want to read only those files that match a specific schema and skip the files that don’t match.\nOne solution could be to read the files in sequence, identify the schema, and union the\nDataFrames\ntogether. However, this approach is impractical when there are hundreds of thousands of files.\nSolution\nSet the Apache Spark property\nspark.sql.files.ignoreCorruptFiles\nto\ntrue\nand then read the files with the desired schema. Files that don’t match the specified schema are ignored. The resultant dataset contains only data from those files that match the specified schema.\nSet the Spark property using\nspark.conf.set\n:\nspark.conf.set(\"spark.sql.files.ignoreCorruptFiles\", \"true\")\nAlternatively, you can set this property in your\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/maven-libraries-start-failing-with-timed-out-errors-when-updating-to-databricks-runtime-113-lts-153-current.json b/scraped_kb_articles/maven-libraries-start-failing-with-timed-out-errors-when-updating-to-databricks-runtime-113-lts-153-current.json new file mode 100644 index 0000000000000000000000000000000000000000..71bbd2c67080a35f80345a610444a4a8af7ce08a --- /dev/null +++ b/scraped_kb_articles/maven-libraries-start-failing-with-timed-out-errors-when-updating-to-databricks-runtime-113-lts-153-current.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/maven-libraries-start-failing-with-timed-out-errors-when-updating-to-databricks-runtime-113-lts-153-current", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen updating Databricks Runtime from previous versions (9.x - 11.x) to any of 11.3 LTS - 15.3 (current), the Maven Libraries start failing with connection timed-out issues while connecting to the repository.\nExample\nServer access error at url https://repo1.maven.org/maven2/com/microsoft/azure/azure-eventhubs-spark_2.12/2.3.22/azure-eventhubs-spark_2.12-2.3.22.pom (java.net.ConnectException: Connection timed out (Connection timed out))\nCause\nAs of Databricks Runtime 11.x, Maven libraries now resolve in your compute plane by default when you install libraries on a cluster. Your cluster\nmust\nhave access to\nMaven Central\n.\nTo review the change notes in the documentation, please see\nDatabricks Runtime 11.0 release notes\n(\nAWS\n|\nAzure\n|\nGCP\n).\nSolution\nWhitelist\nMaven Central\nand the\nnew Maven repo\nfor your cluster to work with this feature.\nIf needed, you can revert your cluster to the previous behavior using the configuration\nspark.databricks.libraries.enableMavenResolution false\nFor more information, please review the Apache Spark settings in the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAdditionally, you may also wish to whitelist the following Maven repos:\nrepos.spark-packages.org\nrepo1.maven.org\nrepo.maven.apache.org\nmaven-central.storage-download.googleapis.com\nIf the issue persists, discard any proxy script that can disrupt Databricks Runtime’s connections to the Maven repositories." +} \ No newline at end of file diff --git a/scraped_kb_articles/maven-library-version-mgmt.json b/scraped_kb_articles/maven-library-version-mgmt.json new file mode 100644 index 0000000000000000000000000000000000000000..35fb2af640e20cb0ca4559dc8c0e403b821062d7 --- /dev/null +++ b/scraped_kb_articles/maven-library-version-mgmt.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/maven-library-version-mgmt", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou make a minor update to a library in the repository, but you don’t want to change the version number because it is a small change for testing purposes. When you attach the library to your cluster again, your code changes are not included in the library.\nCause\nOne strength of Databricks is the ability to install third-party or custom libraries, such as from a Maven repository. However, when a library is updated in the repository, there is no automated way to update the corresponding library in the cluster.\nWhen you request Databricks to download a library in order to attach it to a cluster, the following process occurs:\nIn Databricks, you request a library from a Maven repository.\nDatabricks checks the local cache for the library, and if it is not present, downloads the library from the Maven repository to a local cache.\nDatabricks then copies the library to DBFS (\n/FileStore/jars/maven/\n).\nUpon subsequent requests for the library, Databricks uses the file that has already been copied to DBFS, and does not download a new copy.\nSolution\nTo ensure that an updated version of a library (or a library that you have customized) is downloaded to a cluster, make sure to increment the build number or version number of the artifact in some way. For example, you can change\nlibA_v1.0.0-SNAPSHOT\nto\nlibA_v1.0.1-SNAPSHOT\n, and then the new library will download. You can then attach it to your cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/max-query-on-delta-tables-performs-slowly-on-string-or-timestamp-columns.json b/scraped_kb_articles/max-query-on-delta-tables-performs-slowly-on-string-or-timestamp-columns.json new file mode 100644 index 0000000000000000000000000000000000000000..42f2882bc3e903edba47f2e450bf4c1f2bea8cf9 --- /dev/null +++ b/scraped_kb_articles/max-query-on-delta-tables-performs-slowly-on-string-or-timestamp-columns.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/max-query-on-delta-tables-performs-slowly-on-string-or-timestamp-columns", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou observe queries using the\nMAX()\nfunction on Delta tables with string or timestamp columns perform slowly.\nCause\nMetadata optimizations are not applied for string or timestamp data types by default, leading to full table scans instead of leveraging Delta Lake statistics for performance improvements.\nAdditional context\nDelta Lake uses collected statistics to optimize queries by skipping data based on min/max values. However, for string and timestamp columns, Delta collects statistics only on the first 32 characters (for strings), which may not represent the full range of values in those columns.\nConsequently, the Delta Lake optimizer skips these columns when evaluating whether to perform metadata-based query optimizations (like data skipping or statistics pruning).\nSolution\nEnable metadata query optimizations for these column types by setting the following Apache Spark configuration. Run the following code in a notebook.\nSET spark.databricks.delta.optimizeMetadataQuery.clusteredTable.enabled = TRUE;\nThis setting allows the Delta optimizer to use file-level metadata for string or timestamp columns when the table is clustered, even though full statistics are not available." +} \ No newline at end of file diff --git a/scraped_kb_articles/maximum-execution-context.json b/scraped_kb_articles/maximum-execution-context.json new file mode 100644 index 0000000000000000000000000000000000000000..3d20a6468377d57aa530b72d05ffbfe8b9e601c5 --- /dev/null +++ b/scraped_kb_articles/maximum-execution-context.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/maximum-execution-context", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nNotebook or job execution stops and returns either of the following errors:\nRun result unavailable: job failed with error message\r\nContext ExecutionContextId(1731742567765160237) is disconnected.\nCan’t attach this notebook because the cluster has reached the attached notebook limit. Detach a notebook and retry.\nCause\nWhen you attach a notebook to a cluster, Databricks creates an execution context (\nAWS\n|\nAzure\n). If there are too many notebooks attached to a cluster or too many jobs are created, at some point the cluster reaches its maximum threshold limit of 145 execution contexts, and Databricks returns an error.\nSolution\nConfigure context auto-eviction (\nAWS\n|\nAzure\n), which allows Databricks to remove (evict) idle execution contexts. Additionally, from the pipeline and ETL design perspective, you can avoid this issue by using:\nFewer notebooks to reduce the number of execution contexts that are created.\nA job cluster instead of an interactive cluster. If the use case permits, submit notebooks or jars as jobs." +} \ No newline at end of file diff --git a/scraped_kb_articles/members-of-a-gmail-group-email-not-receiving-notifications-.json b/scraped_kb_articles/members-of-a-gmail-group-email-not-receiving-notifications-.json new file mode 100644 index 0000000000000000000000000000000000000000..5f9f7a8ae9ac17b890faf4fe880e203ed0a64611 --- /dev/null +++ b/scraped_kb_articles/members-of-a-gmail-group-email-not-receiving-notifications-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/members-of-a-gmail-group-email-not-receiving-notifications-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Gmail group notifications for job executions, the set of users belonging to the group do not receive the notifications. Other email clients are not affected.\nCause\nThe group’s incoming email setup does not permit messages from outside the organization.\nSolution\nAllow external entities to email the group inbox.\nLog into your Google Workspace Admin console.\nGo to\nApps / Google Workspace / Groups for Business\n/\nSharing settings.\nCheck\nGroup owners can allow incoming email from outside the organization.\nClick\nSave.\nIf your group already exists, then you also need to make a change within the specific group’s settings.\nNavigate to\nGoogle groups\nand click the group that needs to receive notifications.\nClick on\nGroup settings\nand scroll down the page.\nChange ‘Who can post’ to ‘Anyone on the web’ and click\nSave changes." +} \ No newline at end of file diff --git a/scraped_kb_articles/metastore-admin-group-was-accidentally-deleted-no-access-to-the-owned-catalogs-or-underlying-data.json b/scraped_kb_articles/metastore-admin-group-was-accidentally-deleted-no-access-to-the-owned-catalogs-or-underlying-data.json new file mode 100644 index 0000000000000000000000000000000000000000..ac5e63a1a69227bf67e8c100cedab3a4de14c7ac --- /dev/null +++ b/scraped_kb_articles/metastore-admin-group-was-accidentally-deleted-no-access-to-the-owned-catalogs-or-underlying-data.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/metastore-admin-group-was-accidentally-deleted-no-access-to-the-owned-catalogs-or-underlying-data", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour metastore admin group was accidentally deleted from your Databricks account. The same group owned your Unity Catalog and underlying schemas. As a result, you lost access to the catalogs and schemas.\nCause\nOnce a group is deleted, it cannot be restored. Any associated permissions for the group and/or its users are removed. If the metastore admin entity is deleted, access to the catalogs and schemas is permanently lost.\nSolution\nIf the group was fully deleted (from both the Databricks account console and from your SCIM provider), you must recreate it.\nIf the group was only deleted from the Databricks account console,\nsync the desired group from your SCIM provider\n(\nAWS\n|\nAzure\n|\nGCP\n).\nAfter the group has been recreated (or synced):\nAssign the group as the metastore admin in the Databricks account console.\nAdd yourself to the group.\nRun\nALTER CATALOG\ncommand to update the ownership of the desired catalogs under the metastore to the newly created/synced metastore admin group.\nAdd the desired users to the group.\nInfo\nIf you are using Microsoft Entra ID as your SCIM provider and the users in the group are not synced with the initial SCIM sync, you can\nrequest an immediate sync\n(\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/migration-guidance-for-init-scripts-on-dbfs.json b/scraped_kb_articles/migration-guidance-for-init-scripts-on-dbfs.json new file mode 100644 index 0000000000000000000000000000000000000000..28b00b61450c467c672f8971defcb4e0d72cc8fd --- /dev/null +++ b/scraped_kb_articles/migration-guidance-for-init-scripts-on-dbfs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/migration-guidance-for-init-scripts-on-dbfs", + "title": "Título do Artigo Desconhecido", + "content": "This article provides migration guidance for init scripts on DBFS.\nDetect End-of-Life init scripts\nYou can detect all init scripts stored on DBFS in your workspace by running the\nDBFS init script detection notebook\n. After you have identified any init scripts on DBFS, you should migrate them to supported storage.\nMigrate End-of-Life init scripts:\nThe\nrecommended migration path\n(\nAWS\n|\nAzure\n|\nGCP\n) for your init scripts depends on the init script type and the Databricks Runtime version you plan on using.\nLegacy global init scripts on any Databricks Runtime version\nIf you are still using legacy global init scripts you should immediately\nmigrate them to compute policies\n(\nAWS\n|\nAzure\n) or\nnew global init scripts\n(\nAWS\n|\nAzure\n).\nThis notebook\ncan be used to automate the migration work to new global init scripts.\nCluster-named and cluster-scoped init scripts\nDatabricks Runtime 13.3 LTS and above (Unity Catalog clusters with single-user or shared access mode)\nAll\ncluster-named init scripts should be disabled\n(\nAWS\n|\nAzure\n) and\nconverted to cluster-scoped init scripts\n(\nAWS\n|\nAzure\n).\nMigrate your cluster-scoped init scripts from DBFS to\nUnity Catalog volumes\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nmigration guides\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nDatabricks Runtime 11.3 LTS and above\nAll\ncluster-named init scripts should be disabled\n(\nAWS\n|\nAzure\n) and\nconverted to cluster-scoped init scripts\n(\nAWS\n|\nAzure\n).\nMigrate your cluster-scoped init scripts from DBFS to\nworkspace files\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nmigration guides\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nDatabricks Runtime 10.4 LTS and below\nAll\ncluster-named init scripts should be disabled\n(\nAWS\n|\nAzure\n) and c\nonverted to cluster-scoped init scripts\n(\nAWS\n|\nAzure\n).\nMigrate your cluster-scoped init scripts from DBFS to\ncloud object storage\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nmigration guides\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nUnsupported Databricks Runtime\nRunning an\nunsupported Databricks Runtime\n(\nAWS\n|\nAzure\n|\nGCP\n) is NOT recommended. You should immediately upgrade your clusters to a\nsupported Databricks Runtime version\n(\nAWS\n|\nAzure\n|\nGCP\n).\nInit script documentation\nFor more information on init scripts, please review the\ninit script documentation\n(\nAWS\n|\nAzure\n|\nGCP\n).\nhttps://docs.databricks.com/en/_extras/documents/aws-init-volumes.pdf" +} \ No newline at end of file diff --git a/scraped_kb_articles/missing-the-audit-log-event-of-a-cluster-deletion.json b/scraped_kb_articles/missing-the-audit-log-event-of-a-cluster-deletion.json new file mode 100644 index 0000000000000000000000000000000000000000..75c46d47b089bf1c6f6451151799107f6a8a3843 --- /dev/null +++ b/scraped_kb_articles/missing-the-audit-log-event-of-a-cluster-deletion.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/missing-the-audit-log-event-of-a-cluster-deletion", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you query the audit logs or system table\nsystem.compute.clusters\n, you see a delete time for a given cluster, indicating it was deleted. When you follow up by checking the audit log events for the cluster ID, however, you don’t see a corresponding record matching the deleted cluster event.\nCause\nWhen a cluster is inactive for a certain period (typically 30 days), it is automatically deleted to free up resources. Databricks does not audit clusters that are deleted due to inactivity in the audit logs, so no corresponding event appears.\nSolution\nPin your cluster to avoid automatic deletion due to inactivity. You can use either the Databricks API or the workspace UI to do so. For details, refer to the\nManage compute\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you don’t want to pin your cluster, monitor cluster usage to ensure it isn’t left idle for extended periods. This can help prevent automatic deletion due to inactivity." +} \ No newline at end of file diff --git a/scraped_kb_articles/missing-trace-level-terraform-log-files-to-troubleshoot-template-issue.json b/scraped_kb_articles/missing-trace-level-terraform-log-files-to-troubleshoot-template-issue.json new file mode 100644 index 0000000000000000000000000000000000000000..4f735d36189099f34b9f1ce4a07bd7bcf532ba2c --- /dev/null +++ b/scraped_kb_articles/missing-trace-level-terraform-log-files-to-troubleshoot-template-issue.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/missing-trace-level-terraform-log-files-to-troubleshoot-template-issue", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to get support on a Terraform template issue and want to have detailed log files available to make troubleshooting easier.\nCause\nTerraform does not save\nTRACE\nlevel log files by default.\nSolution\nTo enable Terraform logs you must configure the\nTF_LOG\nand\nTF_LOG_PATH\nenvironment variables before running the\nterraform init\n,\nterraform plan\n, and\nterraform apply\ncommands.\nDepending on your environment the setup of these variables may be different. For example this format is used for a bash console:\nexport TF_LOG=\"TRACE\"\r\nexport TF_LOG_PATH=\"tmp/terraform.log\"\nAfter logging has been enabled, reproduce the issue and then download the log file from the log path.\nYou can review the Terraform log file to troubleshoot the issue. If you need additional help, open a support ticket and attach the log file to the ticket.\nFor more information, review the\nDebugging Terraform\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-api-429-errors-when-transitioning-models.json b/scraped_kb_articles/mlflow-api-429-errors-when-transitioning-models.json new file mode 100644 index 0000000000000000000000000000000000000000..84a7175557d8310fd14ea22eec6529e4cd41a050 --- /dev/null +++ b/scraped_kb_articles/mlflow-api-429-errors-when-transitioning-models.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-api-429-errors-when-transitioning-models", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are downloading artifacts from models when you get an API error message. The error message indicates that the API request to list artifacts for a specific model version has failed, due to too many 429 error responses.\nMax retries exceeded with url: /api/2.0/mlflow/model-versions/list-artifacts?name=model_name&version=version_number&path=some/path (Caused by ResponseError('too many 429 error responses'))\")', 'some/path/logger.py': 'MlflowException(\"API request to https:///api/2.0/mlflow/model-versions/list-artifacts failed with exception HTTPSConnectionPool(host='', port=443): Max retries exceeded with url: /api/2.0/mlflow/model-versions/list-artifacts?name=model_name&version=version_number&path=some/path (Caused by ResponseError('too many 429 error responses'))\")'\nCause\nThe rate limit for the MLflow Workspace Model Registry API is set to 40 queries per second, per workspace. When the rate limit is exceeded, the API returns a 429 error response. This error can occur when multiple jobs or processes are attempting to download artifacts from the same model version simultaneously, causing the rate limit to be exceeded.\nFor more information, review the\nResource limits\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nTo work around this issue, you can set timeout and retry environment variables in your job clusters.\nClick\nWorkflows\nin the left navigation bar.\nClick the name of the job you want to edit.\nClick the\nEdit\nicon (looks like a pencil) in the\nCluster\nfield.\nClick\nAdvanced options\nto expand the section.\nAdd the following lines to the\nEnvironment variables\nfield:\nMLFLOW_HTTP_REQUEST_TIMEOUT=360\nMLFLOW_HTTP_REQUEST_BACKOFF_FACTOR=5\nMLFLOW_HTTP_REQUEST_MAX_RETRIES=8\nClick\nConfirm\n.\nRestart your job.\nThe additional environment variables space out requests on the\n/api/2.0/mlflow/model-versions/list-artifacts\nendpoint that is hitting the rate limit.\nMLFLOW_HTTP_REQUEST_TIMEOUT\nsets the maximum time in seconds to wait for a request to complete.\nMLFLOW_HTTP_REQUEST_BACKOFF_FACTOR\nsets the backoff factor to apply between retry attempts.\nMLFLOW_HTTP_REQUEST_MAX_RETRIES\nsets the maximum number of retries to attempt before giving up.\nIn addition to setting these environment variables, you also consider these best practices to avoid hitting the rate limit:\nLimit the number of concurrent jobs or processes that are accessing the same model version.\nUse versioning to create new versions of models instead of modifying the same version.\nUse the MLflow API to list artifacts for a model version instead of downloading them directly.\nFor more information, review the\nMLflow API\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-artifacts-custom-storage.json b/scraped_kb_articles/mlflow-artifacts-custom-storage.json new file mode 100644 index 0000000000000000000000000000000000000000..76a653c5de6a5755d12859e6ab711136b1a748c5 --- /dev/null +++ b/scraped_kb_articles/mlflow-artifacts-custom-storage.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-artifacts-custom-storage", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you create an MLflow experiment with a custom artifact location, you get the following warning:\nCause\nMLflow experiment permissions (\nAWS\n|\nAzure\n|\nGCP\n) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files.\nMLflow cannot guarantee the enforcement of access controls on artifacts stored in custom locations.\nSolution\nDatabricks recommends using the default artifact location when creating an MLflow experiment.\nThe default storage location is backed by access controls." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-artifacts-download.json b/scraped_kb_articles/mlflow-artifacts-download.json new file mode 100644 index 0000000000000000000000000000000000000000..4f0585e7cab1811d12294f3d2eb4550384f53a63 --- /dev/null +++ b/scraped_kb_articles/mlflow-artifacts-download.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-artifacts-download", + "title": "Título do Artigo Desconhecido", + "content": "By default, the MLflow client saves artifacts to an artifact store URI during an experiment. The artifact store URI is similar to\n/dbfs/databricks/mlflow-tracking///artifacts/\n.\nThis artifact store is a MLflow managed location, so you cannot download artifacts directly.\nYou must use\nclient.download_artifacts\nin the MLflow client to copy artifacts from the artifact store to another storage location.\nExample code\nThis example code downloads the MLflow artifacts from a specific run and stores them in the location specified as\nlocal_dir\n.\nReplace\n\nwith the local path where you want to store the artifacts.\nReplace\n\nwith the\nrun_id\nof your specified MLflow run.\n%python\r\n\r\nimport mlflow\r\nimport os\r\nfrom mlflow.tracking import MlflowClient\r\nclient = MlflowClient()\r\nlocal_dir = \"\"\r\nif not os.path.exists(local_dir):\r\n  os.mkdir(local_dir)\r\n\r\n# Creating sample artifact \"features.txt\".\r\nfeatures = \"rooms, zipcode, median_price, school_rating, transport\"\r\nwith open(\"features.txt\", 'w') as f:\r\n    f.write(features)\r\n\r\n# Creating sample MLflow run & logging artifact \"features.txt\" to the MLflow run.\r\nwith mlflow.start_run() as run:\r\n    mlflow.log_artifact(\"features.txt\", artifact_path=\"features\")\r\n\r\n# Download the artifact to local storage.\r\nlocal_path = client.download_artifacts(, \"features\", local_dir)\r\nprint(\"Artifacts downloaded in: {}\".format(local_dir))\r\nprint(\"Artifacts: {}\".format(local_dir))\nAfter the artifacts have been downloaded to local storage, you can copy (or move) them to an external filesystem or a mount point using standard tools.\nCopy to an external filesystem\n%scala\r\n\r\ndbutils.fs.cp(local_dir, \"\")\nMove to a mount point\n%python\r\n\r\nshutil.move(local_dir, \"/dbfs/mnt/\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-artifacts-legacy-storage.json b/scraped_kb_articles/mlflow-artifacts-legacy-storage.json new file mode 100644 index 0000000000000000000000000000000000000000..54caa02098245b4ca8ab5af784543a7d9eabd713 --- /dev/null +++ b/scraped_kb_articles/mlflow-artifacts-legacy-storage.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-artifacts-legacy-storage", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA new icon appears on the MLflow\nExperiments\npage with the following open access warning:\nCause\nMLflow experiment permissions (\nAWS\n|\nAzure\n|\nGCP\n) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files.\nIn MLflow 1.11 and above, new experiments store artifacts in an MLflow-managed location (\ndbfs:/databricks/mlflow-tracking/\n) that enforces experiment access controls.\nCertain older experiments use a legacy storage location (\ndbfs:/databricks/mlflow/\n) that can be accessed by all users of your workspace.\nThis warning indicates that your experiment uses a legacy artifact storage location.\nSolution\nYou should always use the MLflow-managed DBFS storage locations when logging artifacts to experiments. This protects against unintended or unauthorized access to your MLflow artifacts." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-artifacts-no-client-error.json b/scraped_kb_articles/mlflow-artifacts-no-client-error.json new file mode 100644 index 0000000000000000000000000000000000000000..1324b252ce05e11fe14c4904b669bb6b86a5f306 --- /dev/null +++ b/scraped_kb_articles/mlflow-artifacts-no-client-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-artifacts-no-client-error", + "title": "Título do Artigo Desconhecido", + "content": "MLflow experiment permissions (\nAWS\n|\nAzure\n) are now enforced on artifacts in MLflow Tracking, enabling you to easily control access to your datasets, models, and other files.\nInvalid mount exception\nProblem\nWhen trying to access an MLflow run artifact using Databricks File System (DBFS) commands, such as\ndbutils.fs\n, you get the following error:\ncom.databricks.backend.daemon.data.common.InvalidMountException: Error while using path /databricks/mlflow-tracking///artifacts for resolving path '///artifacts' within mount at '/databricks/mlflow-tracking'.\nCause\nWith the extension of MLflow experiment permissions to artifacts, DBFS access APIs for run artifacts stored in\ndbfs:/databricks/mlflow-tracking/\nare no longer supported.\nSolution\nUpgrade to MLflow client version 1.9.1 or above to download, list, or upload artifacts stored in\ndbfs:/databricks/mlflow-tracking/\n.\n%sh\r\n\r\npip install --upgrade mlflow\nFileNotFoundError\nProblem\nWhen trying to access an MLflow run artifact using\n%sh\n/\nos.listdir()\n, you get the following error:\nFileNotFoundError: [Errno 2] No such file or directory: '/databricks/mlflow-tracking/'\nCause\nWith the extension of MLflow experiment permissions to artifacts, run artifacts stored in\ndbfs:/databricks/mlflow-tracking/\ncan only be accessed using MLflow client version 1.9.1 or above.\nSolution\nUpgrade to MLflow client version 1.9.1 or above to download, list, or upload artifacts stored in\ndbfs:/databricks/mlflow-tracking/\n.\n%sh\r\n\r\npip install --upgrade mlflow" +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-artifacts-outdated-client.json b/scraped_kb_articles/mlflow-artifacts-outdated-client.json new file mode 100644 index 0000000000000000000000000000000000000000..aee30de80c848a96469679a95fc4f7d0246e1132 --- /dev/null +++ b/scraped_kb_articles/mlflow-artifacts-outdated-client.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-artifacts-outdated-client", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou get an\nOSError: No such file or directory\nerror message when trying to download or log artifacts using one of the following:\nMlflowClient.download_artifacts()\nmlflow.[flavor].log_model()\nmlflow.[flavor].load_model()\nmlflow.log_artifacts()\nOSError: No such file or directory: '/dbfs/databricks/mlflow-tracking///artifacts/...'\nCause\nYour MLflow client is out of date.\nOlder versions of MLflow do not provide support for artifacts stored in\ndbfs:/databricks/mlflow-tracking/\n.\nSolution\nUpgrade to MLflow version 1.9.1 or higher and try again.\n%sh\r\n\r\npip install --upgrade mlflow" +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-artifacts-permission-denied.json b/scraped_kb_articles/mlflow-artifacts-permission-denied.json new file mode 100644 index 0000000000000000000000000000000000000000..05cf6086471f2f94e0dddc51b2ef8f0978be0aac --- /dev/null +++ b/scraped_kb_articles/mlflow-artifacts-permission-denied.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-artifacts-permission-denied", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou get a\nPERMISSION_DENIED\nerror when trying to access an MLflow artifact using the MLflow client.\nRestException: PERMISSION_DENIED: User does not have permission to 'View' experiment with id \nor\nRestException: PERMISSION_DENIED: User does not have permission to 'Edit' experiment with id \nCause\nWith the extension of MLflow experiment permissions to artifacts, you must have explicit permission to access artifacts of an MLflow experiment.\nThe error suggests that you do not have permission to access artifacts of the experiment.\nSolution\nAsk the experiment owner to give you the appropriate level of permissions to access the experiment.\nExperiment permissions (\nAWS\n|\nAzure\n|\nGCP\n) automatically apply to artifacts of an experiment." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-error-invalid_parameter_value-during-model-training-and-logging-process.json b/scraped_kb_articles/mlflow-error-invalid_parameter_value-during-model-training-and-logging-process.json new file mode 100644 index 0000000000000000000000000000000000000000..b04ff28f5284c262d3950da89bb0d757e493931a --- /dev/null +++ b/scraped_kb_articles/mlflow-error-invalid_parameter_value-during-model-training-and-logging-process.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-error-invalid_parameter_value-during-model-training-and-logging-process", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDuring model training and the logging process using the MLflow Tracking API, you encounter an error at runtime.\nMlflowException('INVALID_PARAMETER_VALUE: This API assumes that the run XXXXXXXX is active. Its current lifecycleStage is deleted\r\nThe cause of this error is typically due to repeated calls to an individual run_id event logging.\nCause\nYou’re repeatedly logging the same parameters in the same run. MLflow's design expects each run to maintain a unique parameter set. Logging the same parameter multiple times in the same session creates conflicts within the tracking store.\nSolution\nEnsure you log each parameter only once per run.\nIf multiple parameter loggings are necessary, use nested runs to separate parameter logs effectively. Here is an example implementation where the first parameter logging in a nested run has a\nheight\nof\n10\nand the second parameter logging in a different nested run has a\nheight\nof\n50\n.\nwith mlflow.start_run():\r\n    # First parameter logging in a nested run\r\n    with mlflow.start_run(nested=True):\r\n        mlflow.log_param(\"height\", 10)\r\n    # Second parameter logging in another nested run\r\n    with mlflow.start_run(nested=True):\r\n        mlflow.log_param(\"height\", 50)\nThis approach avoids parameter key (name or identifier) collisions by isolating each logging instance within its own nested run context." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-exception-error-when-trying-to-migrate-models-from-workspace-model-registry.json b/scraped_kb_articles/mlflow-exception-error-when-trying-to-migrate-models-from-workspace-model-registry.json new file mode 100644 index 0000000000000000000000000000000000000000..6ae531e838d977352299603635669498b7710621 --- /dev/null +++ b/scraped_kb_articles/mlflow-exception-error-when-trying-to-migrate-models-from-workspace-model-registry.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-exception-error-when-trying-to-migrate-models-from-workspace-model-registry", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to migrate models in your Workspace Model Registry to a Model Registry in Unity Catalog, you receive an error.\nroot/.ipykernel/1712/command-5345605615358575-1459565655:51: FutureWarning: ``mlflow.tracking.client.MlflowClient.get_latest_versions`` is deprecated since 2.9.0. Model registry stages will be removed in a future major release. To learn more about the deprecation of model registry stages, see our migration guide here: https://mlflow.org/docs/latest/model-registry.html#migrating-from-stages\r\n  model_versions = client.get_latest_versions(model_name)\r\n\r\nMlflowException: Method 'get_latest_versions' is unsupported for models in the Unity Catalog. To load the latest version of a model in Unity Catalog, you can set an alias on the model version and load it by alias. See https://mlflow.org/docs/latest/model-registry.html#deploy-and-organize-models-with-aliases-and-tags for details.\r\nFile , line 11\r\n      8 model_name = \"\"\r\n     10 # Get the latest version (or specify a version)\r\n---> 11 model_versions = client.get_latest_versions(model_name)\r\n     13 # Print details of all versions\r\n     14 for version_info in model_versions:\nCause\nThe default registry URI is set to Unity Catalog. The MLflow client is attempting to access the default URI instead of the Workspace Model Registry.\nSolution\nSet the registry URI to the Workspace Model Registry before running the MLflow operations. Add the following line to your code to set the registry URI to your Workspace Model Registry.\n```python\r\nmlflow.set_registry_uri(\"databricks\")\r\n```" +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-fail-access-hive.json b/scraped_kb_articles/mlflow-fail-access-hive.json new file mode 100644 index 0000000000000000000000000000000000000000..db7a6eba14800d71d0c8a7d09f999c23cd118020 --- /dev/null +++ b/scraped_kb_articles/mlflow-fail-access-hive.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-fail-access-hive", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an MLflow project that fails to access a Hive table and returns a\nTable or view not found\nerror.\npyspark.sql.utils.AnalysisException: \"Table or view not found: `default`.`tab1`; line 1 pos 21;\\n'Aggregate [unresolvedalias(count(1), None)]\\n+- 'UnresolvedRelation `default`.`tab1`\\n\"\r\nxxxxx ERROR mlflow.cli: === Run (ID 'xxxxx') failed ===\nCause\nThis happens when the\nSparkSession\nobject is created inside the MLflow project without Hive support.\nSolution\nConfigure\nSparkSession\nwith the\n.enableHiveSupport()\noption in the session builder. Do this as part of your MLflow project.\n%scala\r\n\r\nval spark = SparkSession.builder.enableHiveSupport().getOrCreate()" +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-invalid-access-token-error.json b/scraped_kb_articles/mlflow-invalid-access-token-error.json new file mode 100644 index 0000000000000000000000000000000000000000..95e4263ae9d0c666962c32c8ede0e4a52a1cb11e --- /dev/null +++ b/scraped_kb_articles/mlflow-invalid-access-token-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-invalid-access-token-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have long-running MLflow tasks in your notebook or job and the tasks are not completed. Instead, they return a\n(403) Invalid access token\nerror message.\nError stack trace: MlflowException: API request to endpoint /api/2.0/mlflow/runs/create failed with error code\n403 != 200. Response body: '\n\n\nError 403 Invalid access token.\n\n

HTTP ERROR 403

\n

Problem accessing /api/2.0/mlflow/runs/create. Reason:\n

 Invalid access token.

\n\n\nCause\nThe Databricks access token that the MLflow Python client uses to communicate with the tracking server expires after several hours. If your ML tasks run for an extended period of time, the access token may expire before the task completes. This results in MLflow calls failing with a\n(403) Invalid access token\nerror message in both notebooks and jobs.\nSolution\nYou can work around this issue by manually creating an access token with an extended lifetime and then configuring that access token in your notebook prior to running MLflow tasks.\nGenerate a personal access token (\nAWS\n|\nAzure\n) and configure it with an extended lifetime.\nSet up the Databricks CLI (\nAWS\n|\nAzure\n).\nUse the Databricks CLI to create a new secret with the personal access token you just created.\ndatabricks secrets put --scope {} --key mlflow-access-token --string-value {}\nInsert this sample code at the beginning of your notebook. Include your secret name and your Workspace URL (\nAWS\n|\nAzure\n).\n%python\r\n\r\naccess_token = dbutils.secrets.get(scope=\"{}\", key=\"mlflow-access-token\")\r\n\r\nimport os\r\nos.environ[\"DATABRICKS_TOKEN\"] = access_token\r\nos.environ[\"DATABRICKS_HOST\"] = \"https://\"\r\n\r\nfrom databricks_cli.configure import provider\r\nconfig_provider = provider.EnvironmentVariableConfigProvider()\r\nprovider.set_config_provider(config_provider)\nRun your notebook or job as normal." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflow-models-saving-to-local-path-when-registered-instead-of-to-the-desired-registry.json b/scraped_kb_articles/mlflow-models-saving-to-local-path-when-registered-instead-of-to-the-desired-registry.json new file mode 100644 index 0000000000000000000000000000000000000000..570920acfddbc9f14b5ea486a83ba2dd1fe0bcc7 --- /dev/null +++ b/scraped_kb_articles/mlflow-models-saving-to-local-path-when-registered-instead-of-to-the-desired-registry.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflow-models-saving-to-local-path-when-registered-instead-of-to-the-desired-registry", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working in a notebook on a cluster configured to use a Docker container, you register your models using MLflow and notice they save to a local path (such as a workspace path).\nHowever, you want the model to save to\nMachine Learning > Models\n, as shown in the following image.\nCause\nYour Docker container-configured cluster’s tracking URI is set to the workspace location.\nSolution\nDatabricks recommends managing your model lifecycle in Unity Catalog. For more information, refer to the\nManage model lifecycle in Unity Catalog\n(\nAWS\n|\nAzure\n) documentation.\nTo set a model-tracking URI to Unity Catalog, you can use the following code.\nCode for testing\nimport os\r\nimport mlflow\r\ndb_host = \"\"\r\ndb_token = \"\"\r\nmlflow.set_tracking_uri('databricks-uc')\r\nos.environ[\"DATABRICKS_HOST\"] = \r\nos.environ[\"DATABRICKS_TOKEN\"] = \nCode for production\nIn production, store the PAT token in a secret and use\ndbutils\nto get the secret.\nimport os\r\nimport mlflow\r\ndb_host = \"\"\r\ndb_token = dbutils.secrets.get(\"\", \"\")\r\nmlflow.set_tracking_uri('databricks-uc')\r\nos.environ[\"DATABRICKS_HOST\"] = \r\nos.environ[\"DATABRICKS_TOKEN\"] = \nAlternatively, you can manage your model lifecycle using the Workspace Model Registry. For more information, refer to the\nManage model lifecycle using the Workspace Model Registry (legacy)\n(\nAWS\n|\nAzure\n) documentation.\nUse the same code to set a model-tracking URI to Databricks, but change the line\nmlflow.set_tracking_uri('databricks-uc')\nto\nmlflow.set_tracking_uri('databricks')\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/mlflowclientsearch_runs-returns-only-a-subset-of-runs.json b/scraped_kb_articles/mlflowclientsearch_runs-returns-only-a-subset-of-runs.json new file mode 100644 index 0000000000000000000000000000000000000000..878553bc881d41e25ec266249de4776a1e0b70b4 --- /dev/null +++ b/scraped_kb_articles/mlflowclientsearch_runs-returns-only-a-subset-of-runs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/mlflowclientsearch_runs-returns-only-a-subset-of-runs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe\nMlflowClient.search_runs()\nfunction in MLflow searches for runs that match a given set of criteria within specified experiments. You can use this to retrieve and order runs based on specific metrics, parameters, or tags. When you try to use\nMlflowClient.search_runs()\nin your current notebook, it only returns a subset of the runs instead of the actual number of runs.\nFor example, an experiment with more than 300 runs may only return 17 rows.\nExample code\n%python\r\n\r\nimport mlflow\r\nfrom mlflow.tracking.client import MlflowClient\r\n\r\n# Use the experiment notebook path to get the experiment ID\r\nexperiment_name = ''\r\nClient = MlflowClient()\r\nexperiment_id= client.get_experiment_by_name(experiment_name).experiment_id\r\nruns_count = len(client.search_runs(experiment_ids=experiment_id))\nWhen you read the value of\nruns_count\nit is less than the actual number of runs in the experiment.\nThe example experiment UI only shows 17 runs.\nCause\nThe\nMlflowClient.search_runs()\nmethod is based on pagination, which means it might not return all runs in a single call but rather the runs present on a single page of an experiment.\nSolution\nTo ensure an accurate run count, you can use one of two approaches.\nUse\nmlflow.search_runs()\nUse\nmlflow.search_runs()\ninstead of\nMlflowClient.search_runs()\n.\nFor this approach, you must first use\nMlflowClient\nto get the\nexperiment_id\nthrough the notebook path associated with the experiment.\nAfter you have the\nexperiment_id\nyou can use it with the\nmlflow.search_runs()\nfunction which returns the total number of runs as expected.\nExample code\nReplace\n\nwith the full path to the notebook associated with the experiment before running this example code.\n%python\r\n\r\nimport mlflow\r\nfrom mlflow.tracking.client import MlflowClient\r\n\r\n# Use the experiment notebook path to get the experiment ID\r\nexperiment_name = ''\r\nClient = MlflowClient()\r\nexperiment_id= client.get_experiment_by_name(experiment_name).experiment_id\r\n\r\nruns = mlflow.search_runs(experiment_ids=experiment_id)\r\nruns_count = len(runs)\nInclude extra parameters when calling\nMlflowClient.search_runs()\nTo handle pagination correctly when using the\nsearch_runs\nmethod in the\nMlflowClient\n, you need to ensure that you are iterating through all the pages of results.\nInitialize an empty list\nruns\nto store the results and set\npage_token\nto\nNone\n.\nUse a\nwhile\nloop to continuously call\nsearch_runs()\nuntil all pages are retrieved. In each iteration, call\nsearch_runs\nwith the necessary parameters:\nexperiment_ids\n- List of experiment IDs to search within.\n\n- Maximum number of results to return per page (e.g., 100).\npage_token\n- Token for the next page of results (initially None).\nExtend the\nruns\nlist with the results from the current page. If\nresult.token\nis\nNone\n, it means there are no more pages to retrieve, and the loop can be exited. Otherwise, set\npage_token\nto\nresult.token\nto retrieve the next page in the subsequent iteration.\nExample code\nReplace\n\nwith the full path to the notebook associated with the experiment and set the\n\nvalue before running this example code.\n%python\r\n\r\nimport mlflow\r\nfrom mlflow.tracking.client import MlflowClient\r\n\r\n# Use the experiment notebook path to get the experiment ID\r\nexperiment_name = ''\r\nClient = MlflowClient()\r\nexperiment_id= client.get_experiment_by_name(experiment_name).experiment_id\r\n\r\npage_token = None\r\nruns = []\r\nwhile True:\r\n    result = client.search_runs(\r\n        experiment_ids=[experiment_id],\r\n        max_results=,\r\n        page_token=page_token\r\n    )\r\n    runs.extend(result)\r\n    if not result.token:\r\n        break\r\n    page_token = result.token\r\nprint(len(runs))" +} \ No newline at end of file diff --git a/scraped_kb_articles/model-lineage-not-showing-source-delta-tables-in-the-graph-for-databricks-runtime-15-3-or-above.json b/scraped_kb_articles/model-lineage-not-showing-source-delta-tables-in-the-graph-for-databricks-runtime-15-3-or-above.json new file mode 100644 index 0000000000000000000000000000000000000000..0849cfcb3da66f64c0dfcd4a722b7bfaf1db690c --- /dev/null +++ b/scraped_kb_articles/model-lineage-not-showing-source-delta-tables-in-the-graph-for-databricks-runtime-15-3-or-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/model-lineage-not-showing-source-delta-tables-in-the-graph-for-databricks-runtime-15-3-or-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen training and registering a model using Delta tables in Unity Catalog (UC) you can see the lineage graph, but not the source Delta tables used to create it.\nCause\nSupport for table-to-model lineage is available from MLflow 2.11.0 and above, which is available as part of Databricks Runtime 15.3 and above.\nSolution\nIf you’re using Databricks Runtime 15.3 or above, to view the Delta tables in UC used to make the lineage graph, first load them using the following code.\ntrain_spark = mlflow.data.load_delta(table_name=)\r\ntest_spark = mlflow.data.load_delta(table_name=)\nThen, convert the tables to Pandas so the core model can take the Spark DataFrames as inputs. Create\nX_train\n,\nX_test\n,\ny_train\nand\ny_test\nusing the following code.\nX_train = train_spark.df.toPandas().drop([“”], axis=1)\r\nX_test = test_spark.df.toPandas().drop([“”], axis=1)\r\ny_train = train_spark.df.select(“”).toPandas()\r\ny_test = test_spark.df.select(“”).toPandas()\nFinally, when starting the MLflow run, log the input.\nwith mlflow.start_run(run_name='untuned_random_forest'):\r\n…\r\nmodel.fit(X_train_spark, y_train_spark)\r\nmlflow.log_input(train_spark, \"training\")\r\nmlflow.log_input(test_spark,\"test\")\r\n...\nIf you do not want to use Databricks Runtime 15.3 or above, first install MLfLow version 2.11.0 manually, then follow the steps in the previous part of the solution." +} \ No newline at end of file diff --git a/scraped_kb_articles/model-serving-endpoint-creation-fails-with-badrequest-error.json b/scraped_kb_articles/model-serving-endpoint-creation-fails-with-badrequest-error.json new file mode 100644 index 0000000000000000000000000000000000000000..fb0401e1ea2da8d65eb94e9ccdfd6e7274da754b --- /dev/null +++ b/scraped_kb_articles/model-serving-endpoint-creation-fails-with-badrequest-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/model-serving-endpoint-creation-fails-with-badrequest-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile trying to create a model-serving endpoint you encounter an error message.\n`BadRequest: Cannot create 2+ served entities with the same name under the same endpoint config version. You will need to specify served entity names if the given or default names ((model_name)-(model_version)) collide.`\nCause\nWhen the system validates the served model names (composed of catalog name and schema name), it first truncates them to the maximum length limit of 64 characters. Although two names may appear different, they end up treated as identical during validation.\nFor example, if a served model is created with the name\n\"yourname123456789\"\nand another served model is attempted with the name\n\"yourname123456789XYZ\"\nthe system throws the error because the\nXYZ\nin the second name is truncated.\nSolution\nReduce the catalog name length to shorten the overall endpoint name.\nAlternatively, pass\n`endpoint_name`\nas a\nkwarg\nto\n`agents.deploy`\n.\nagents.deploy(, , endpoint_name=\"\")\nFor more information, refer to the\nAgent Framework\nAPI documentation and the\nDeploy an agent for generative AI application\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/model-serving-endpoint-creation-succeeds-but-deployment-fails-and-error-stack-trace-has-message-_array_api-not-found.json b/scraped_kb_articles/model-serving-endpoint-creation-succeeds-but-deployment-fails-and-error-stack-trace-has-message-_array_api-not-found.json new file mode 100644 index 0000000000000000000000000000000000000000..ff779a84635dd526eae2f8f6419d16429714481c --- /dev/null +++ b/scraped_kb_articles/model-serving-endpoint-creation-succeeds-but-deployment-fails-and-error-stack-trace-has-message-_array_api-not-found.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/model-serving-endpoint-creation-succeeds-but-deployment-fails-and-error-stack-trace-has-message-_array_api-not-found", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile creating a model serving endpoint, the container image creation finishes successfully but fails to deploy.\nThe endpoint page suggests checking the service logs for more information. When you review the error stack trace you see at the end of the trace\n_ARRAY_API\nnot found.\nStack trace example - truncated\nTraceback (most recent call last):  File \"/opt/conda/envs/mlflow-env/bin/gunicorn\", line 8, in \r\nsys.exit(run())\r\n…\r\nFile \"/opt/conda/envs/mlflow-env/lib/python3.10/site-packages/pyarrow/__init__.py\", line 65, in \r\nimport pyarrow.lib as _lib\r\nAttributeError\r\n:\r\n_ARRAY_API not found\nCause\nYou’re trying to run a module in one version of NumPy when it was compiled in a previous version. This happens when you log the MLflow model without specifying NumPy as a dependency, or specifying a version range.\nSolution\nRegister a new version of your model. When you re-run the source code (notebook or job) that contains the\nmlflow..log_model\nfunction, include NumPy as a pip dependency and specify the version range to install, such as\nnumpy<2\n.\nExample\nmlflow..log_model(\r\n   modelObj,\r\n   artifact_path = model_artifact_path,\r\n   pip_requirements = [\r\n   \"...\", # Your other packages with specific pinned versions\r\n   \"numpy<2\" # Requirement to install a numpy version lower than 2\r\n)" +} \ No newline at end of file diff --git a/scraped_kb_articles/modulenotfounderror-no-module-named-packaging-when-creating-gpu-model-serving-endpoint.json b/scraped_kb_articles/modulenotfounderror-no-module-named-packaging-when-creating-gpu-model-serving-endpoint.json new file mode 100644 index 0000000000000000000000000000000000000000..8ad2a8aadb8752b70fd4093cc83a312c6737dc4f --- /dev/null +++ b/scraped_kb_articles/modulenotfounderror-no-module-named-packaging-when-creating-gpu-model-serving-endpoint.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/modulenotfounderror-no-module-named-packaging-when-creating-gpu-model-serving-endpoint", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour container build process fails during model serving endpoint creation after logging a machine learning model using MLflow. You have run the logging and registering code on a Databricks GPU cluster.\nThe following error appears in the Model Serving endpoint's Build Logs, stating that the ‘packaging’ module is missing.\nInstalling pip dependencies: ...working... Pip subprocess error:\r\n#20 82.39 error: subprocess-exited-with-error\r\n#20 82.39 × python setup.py egg_info did not run successfully.\r\n#20 82.39 │ exit code: 1\r\n#20 82.39 ╰─> [6 lines of output]\r\n#20 82.39 Traceback (most recent call last):\r\n#20 82.39 File \"\", line 2, in \r\n#20 82.39 File \"\", line 34, in \r\n#20 82.39 File \"/tmp/pip-install-0d8nbi6j/flash-attn_729e3168ad4e43fcbebe75f2aa40d649/setup.py\", line 9, in \r\n#20 82.39 from packaging.version import parse, Version\r\n#20 82.39 ModuleNotFoundError: No module named 'packaging'\r\n#20 82.39 [end of output]\nCause\nWhen logging your model using\nmlflow..log_model\nwithout declaring the\npip_requirements\nparameter (which contains the model's dependencies and their versions in a list of strings format), MLflow defaults to inferring the model's dependencies based on the current notebook session and persists them in the\nrequirements.txt\nfile.\nIn this file, each dependency is listed in alphabetical order. However, this causes an issue because\nflash-attn\ncomes before the packaging module, but\nflash-attn\nrequires the packaging module to be installed first.\nSince the model's dependencies are installed during the container build process in the alphabetical order specified in\nrequirements.txt\n, the packaging module is considered missing during the\nflash-attn\ninstallation.\nSolution\nRe-log your model in the same notebook/source file and explicitly include the\npip_requirements\nparameter with the packaging module ordered first.\nGet the notebook/source file used to log your model ready.\nNavigate to the\nExperiment\npage of your model and then to the run that produced the model you want to serve.\nWithin the\nRun\npage, click on the\nArtifacts\ntab and then on the\nrequirements.txt\nfile.\nSelect all the content of this file and copy it.\nCreate a list of strings from the copied content of the\nrequirements.txt\nfile, where each string represents one dependency package required by your model.\nReorder the ‘packaging’ dependency to position it before the\nflash-attn\npackage. You can also simply delete\nflash-attn\nif you do not need it.\nIn the notebook/source file used to log your model, add the\npip_requirements\nparameter to the\nmlflow..log_model\nfunction, setting it to the list of strings you manipulated in the previous step.\nRerun your source code to re-log and register a new version of your model.\nAfter the new model version is created, proceed to serve your model using Model Serving.\nIf you use transformer-based models, Databricks also recommends following an optimized-mpt-serving approach, which is to include metadata when logging the model:\nmetadata = {\"task\": \"llm/v1/completions\"}\nFor more information, please review the\nOptimized large language model (LLM) serving\n(\nAWS\n|\nAzure\n) documentation.\nRedeploy the model using metadata information as a new version and serve the model using the API.\nIf the above steps do not resolve the issue, please specify\nCUDA_HOME\nin the Dockerfile in order to support\nflash-attn\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/multi-task-workflows-using-incorrect-parameter-values.json b/scraped_kb_articles/multi-task-workflows-using-incorrect-parameter-values.json new file mode 100644 index 0000000000000000000000000000000000000000..6a821920a77a6ef08560ef2d5a28400123a44490 --- /dev/null +++ b/scraped_kb_articles/multi-task-workflows-using-incorrect-parameter-values.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/multi-task-workflows-using-incorrect-parameter-values", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nUsing key-value parameters in a multi task workflow is a common use case. It is normal to have multiple tasks running in parallel and each task can have different parameter values for the same key. These key-value parameters are read within the code and used by each task.\nFor example, assume you have four tasks:\ntask1\n,\ntask2\n,\ntask3\n, and\ntask4\nwithin a workflow job.\ntable-name\nis the parameter key and the parameter values are\nemployee\n,\ndepartment\n,\nlocation\n, and\ncontacts\n.\nWhen you run the job, you expect each task to get its own parameters. However if the application code uses Scala companion objects, you may notice one of the task parameters gets applied to all other tasks, instead of the respective parameters for each task getting applied. This produces inconsistent results.\nUsing our example, if the tasks are run in parallel using Scala companion objects, any one task parameter (for example,\ntask4\nparameter\ncontacts\n) may get passed as the table name to the other three tasks.\nCause\nWhen companion objects are used within application code, there is a mutable state in the companion object that is modified concurrently. Since all tasks run on the same cluster, this class is loaded once and all tasks run under the same Java virtual machine (JVM).\nSolution\nYou can mitigate the issue by applying one of these solutions. The best choice depends on your specific use case.\nRun the jobs sequentially (add dependencies in tasks).\nSchedule each task on a different cluster.\nRewrite the code that loads the configuration so you are explicitly creating a new object and not using the companion object's shared state." +} \ No newline at end of file diff --git a/scraped_kb_articles/multiple-executors-single-worker.json b/scraped_kb_articles/multiple-executors-single-worker.json new file mode 100644 index 0000000000000000000000000000000000000000..ef6911cff62cc7569f13365b8eee7cc0417e46db --- /dev/null +++ b/scraped_kb_articles/multiple-executors-single-worker.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/multiple-executors-single-worker", + "title": "Título do Artigo Desconhecido", + "content": "When you create a cluster, Databricks launches one Apache Spark executor instance per worker node, and the executor uses all of the cores on the node. In certain situations, such as if you want to run non-thread-safe JNI libraries, you might need an executor that has only one core or task slot, and does not attempt to run concurrent tasks. In this case, multiple executor instances run on a single worker node, and each executor has only one core.\nIf you run multiple executors, you increase the JVM overhead and decrease the overall memory available for processing.\nTo start single-core executors on a worker node, configure two properties in the\nSpark Config\n:\nspark.executor.cores\nspark.executor.memory\nThe property\nspark.executor.cores\nspecifies the number of cores per executor. Set this property to\n1\n.\nThe property\nspark.executor.memory\nspecifies the amount of memory to allot to each executor. This must be set high enough for the executors to properly function, but low enough to allow for all cores to be used.\nDelete\nNote\nIf you set a total memory value (memory per executor x number of total cores) that is greater than the memory available on the worker node, some cores will remain unused.\nAWS\nFor example, an\ni3.xlarge\nnode, which has 30.5 GB of memory, shows available memory at 24.9 GB. Choose a value that fits the available memory when multiplied by the number of executors. You may need to set a value that allows for some overhead. For example, set\nspark.executor.cores\nto\n1\nand\nspark.executor.memory\nto\n6g\n:\nThe\ni3.xlarge\ninstance type has 4 cores, and so 4 executors are created on the node, each with 6 GB of memory.\nDelete\nGCP\nFor example, the\nn1-highmem-4\nworker node has 26 GB of total memory, but only has 15.3 GB of available memory once the cluster is running.\nUsing an example\nSpark Config\nvalue, we set the core value to 1 and assign 5 GB of memory to each executor.\nspark.executor.cores 1\r\nspark.executor.memory 5g\nOnce the cluster starts, the worker nodes each have 4 cores, but only 3 are used. There are 3 executors, each with 5 GB of memory on each worker node. This is a total of 15 GB of memory used.\nThe fourth core never spins up, as there is not enough memory to allocate to it.\nYou must balance your choice of instance type with the memory required by each executor in order to maximize the use of every core on your worker nodes.\nDelete" +} \ No newline at end of file diff --git a/scraped_kb_articles/multiple-identical-files-being-written-to-badrecordspath-instead-of-just-one-file-when-writing-code-to-read-a-csv-file-as-a-dataframe.json b/scraped_kb_articles/multiple-identical-files-being-written-to-badrecordspath-instead-of-just-one-file-when-writing-code-to-read-a-csv-file-as-a-dataframe.json new file mode 100644 index 0000000000000000000000000000000000000000..deae2f978de871567c78cf7dae08c44ec0882bbf --- /dev/null +++ b/scraped_kb_articles/multiple-identical-files-being-written-to-badrecordspath-instead-of-just-one-file-when-writing-code-to-read-a-csv-file-as-a-dataframe.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/multiple-identical-files-being-written-to-badrecordspath-instead-of-just-one-file-when-writing-code-to-read-a-csv-file-as-a-dataframe", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the following code to read a CSV file as a DataFrame, you notice multiple identical files being written to\nbadRecordsPath\ninstead of just one file and you cannot see information on which malformed records are present in the table.\ndf_reader = (\r\n  spark.read.format(\"com.databricks.spark.csv\") # Specifies the format as CSV using Databricks' CSV reader\r\n  .schema() # Applies a predefined schema to the DataFrame\r\n  .option(\"header\", \"true\" if headers is None else \"false\") # Sets the header option based on the presence of headers\r\n  .option(\"delimiter\", sep) # Specifies the delimiter used in the CSV file\r\n  .option(\"badRecordsPath\", ) # Specifies a path to store records that are malformed or corrupt\r\n)\r\n\r\ndf = df_reader.load() # Loads the CSV file from the specified filepath into a DataFrame\nCause\nAlthough\nbadRecordsPath\nis used to specify a location where records are stored that do not conform to the expected schema or encounter errors during processing,\nbadRecordsPath\ndoes not make transaction guarantees.\nSolution\nUse the\n.option(\"mode\", \"PERMISSIVE\")\nsetting to configure the CSV reader to handle schema mismatches and malformed records. The following code provides an example.\ndf = (\r\n    spark.read.format(\"csv\")\r\n    .schema()  # Assuming 'schema' is defined in the context\r\n    .option(\"header\", \"false\")\r\n    .option(\"mode\", \"PERMISSIVE\")\r\n    .option(\"rescuedDataColumn\", \"\")  # Specify a different column name\r\n)\r\n    df = df_reader.load(your-filepath) # Loads the CSV file from the specified filepath into a DataFrame\nFor more information, refer to the\nRead CSV files\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAdditional benefits\nUsing\n.option(\"mode\", \"PERMISSIVE\")\nhas two additional benefits.\nAny fields in the CSV file that do not match the predefined schema are captured in a special column called\n_rescued_data\n. This column contains a JSON blob with the mismatched fields and the source file path of the record. This allows you to inspect and handle these mismatched fields separately without losing any data.\nThe reader does not drop the entire row if there are schema mismatches. Instead, it inserts null for fields that could not be parsed correctly and captures the mismatched data in the\n_rescued_data column\n. This ensures that you do not lose any records due to schema mismatches, and you can later review and correct the data as needed." +} \ No newline at end of file diff --git a/scraped_kb_articles/multiple-tables-created-in-a-dlt-pipeline-using-for-loop-have-the-same-data-schema-of-the-last-source-table.json b/scraped_kb_articles/multiple-tables-created-in-a-dlt-pipeline-using-for-loop-have-the-same-data-schema-of-the-last-source-table.json new file mode 100644 index 0000000000000000000000000000000000000000..1c61d47c8d4dcab9a6e4cdcee98b76c365f4ee65 --- /dev/null +++ b/scraped_kb_articles/multiple-tables-created-in-a-dlt-pipeline-using-for-loop-have-the-same-data-schema-of-the-last-source-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/multiple-tables-created-in-a-dlt-pipeline-using-for-loop-have-the-same-data-schema-of-the-last-source-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with Delta Live Tables (DLT), you notice all tables created within a for loop in your DLT pipeline load similar data columns. You observe that your tables have identical schemas, or the DLT graph of all the target tables points to a single source table.\nCause\nWhen creating DLT tables within a for loop using the\n@dlt.table\ndecorator, if the source table or file path is not passed as an argument to the function, the function will reference the last value in the for loop.\nContext\nThe lazy execution model that pipelines use to evaluate Python code requires that your logic directly references individual values when the function decorated by\n@dlt.table()\nis invoked.\nThe for loop evaluates logic in serial order, but once planning is complete for the datasets, the pipeline runs logic in parallel.\nExample of how the issue occurs\nThree tables are created in both the bronze and silver layers by referring to the respective tables in the previous layer. The expected behavior is a one-to-one map of source to table.\nsource_1 → table1\nsource_2 → table2\nsource_3 → table3\nThe DLT notebook code used is the following.\nimport dlt\r\ntables = [\"source_1\", \"source_2\", \"source_3\"]\r\nfor table_name in tables:\r\n   @dlt.table(name=table_name)\r\n   def create_table():\r\n       return spark.read.table(\"..
\")\r\n\r\ntables = [\"table1\", \"table2\", \"table3\"]\r\nlist={\"table1\":\"source_1\",\"table2\":\"source_2\",\"table3\":\"source_3\"}\r\nfor t_name in tables:\r\n @dlt.table(name=t_name)\r\n def create_table():\r\n   return spark.read.table(f\"live.{list[t_name]}\")\nOnce the DLT pipeline is executed, all the tables in the silver layer (table1, table2, and table3) are created by pointing to the same table source_3 from the previous (bronze) layer. The behavior that occurs is a one-to-three map of source to tables, instead of one-to-one.\nsource_1 →\nsource_2 →\nsource_3 → table1, table2, table3\nThe above example does not correctly reference the values. It creates tables with distinct names, but all tables load the data from the last value in the for loop. As a result, all tables will be created with the schema of the last source file/table processed in the loop.\nSolution\n1. In a notebook, modify your code to pass the source table as an argument to the\ncreate_table\nfunction instead.\nimport dlt\r\ntables = [\"source_1\", \"source_2\", \"source_3\"]\r\nfor table_name in tables:\r\n   @dlt.table(name=table_name)\r\n   def create_table():\r\n       return spark.read.table(\"..\")\r\n\r\ntables = [\"table1\", \"table2\", \"table3\"]\r\nlist={\"table1\":\"source_1\",\"table2\":\"source_2\",\"table3\":\"source_3\"}\r\nfor t_name in tables:\r\n @dlt.table(name=t_name)\r\n def create_table(source_table_name=f\"{list[t_name]}\"):\r\n   return spark.read.table(f\"live.{source_table_name}\")\n2. Navigate to your DLT pipeline and execute it.\n3. After the DLT pipeline is executed, verify the DLT graph to make sure the sources are mapping to tables one-to-one." +} \ No newline at end of file diff --git a/scraped_kb_articles/multiple_xml_data_source-error-while-working-with-xml-data.json b/scraped_kb_articles/multiple_xml_data_source-error-while-working-with-xml-data.json new file mode 100644 index 0000000000000000000000000000000000000000..c19a05e7c39907201636bb2e417fbd64b1236b7a --- /dev/null +++ b/scraped_kb_articles/multiple_xml_data_source-error-while-working-with-xml-data.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/multiple_xml_data_source-error-while-working-with-xml-data", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile working with XML data in migration to 14.3 Databricks Runtime, you receive an error that multiple XML data sources were detected.\nAnalysisException: [MULTIPLE_XML_DATA_SOURCE] Detected multiple data sources with the name xml (com.databricks.spark.xml.DefaultSource, org.apache.spark.sql.execution.datasources.xml.XmlFileFormat)\nCause\nStarting with Databricks Runtime version 14.3, Databricks now natively supports XML read and write operations. If an external Spark XML library (such as\nspark_xml_2_12_0_12_0.jar\n) is installed on the cluster, it may conflict with the built-in XML classpath.\nSolution\nRemove the external XML library, such as\nspark_xml_2_12_0_12_0.jar\n, from the cluster.\nFor more information about the XML format in Databricks, please refer to the\nRead and write XML files\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/namespace-onload.json b/scraped_kb_articles/namespace-onload.json new file mode 100644 index 0000000000000000000000000000000000000000..34def386035a438c05ba91a12a2b346d62a2a239 --- /dev/null +++ b/scraped_kb_articles/namespace-onload.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/namespace-onload", + "title": "Título do Artigo Desconhecido", + "content": "This article explains how to resolve a package or namespace loading error.\nProblem\nWhen you install and load some libraries in a notebook cell, like:\n%r\r\n\r\nlibrary(BreakoutDetection)\nYou may get a package or namespace error:\nLoading required package: BreakoutDetection:\r\n\r\nError : package or namespace load failed for ‘BreakoutDetection’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):\r\nnamespace ‘rlang’ 0.3.1 is already loaded, but >= 0.3.4 is required\nCause\nWhile a notebook is attached to a cluster, the R namespace cannot be refreshed. When an R package depends on a newer package version, the required package is downloaded but not loaded. When you load the package, you can observe this error.\nSolution\nTo resolve this error, install the required package as a cluster-installed library (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/nat-gateway-configuration-issues-with-azure-databricks-workspace-using-vnet-injection.json b/scraped_kb_articles/nat-gateway-configuration-issues-with-azure-databricks-workspace-using-vnet-injection.json new file mode 100644 index 0000000000000000000000000000000000000000..4cf3d3f1c11472cbdcc0bcce579a757d1ab6014e --- /dev/null +++ b/scraped_kb_articles/nat-gateway-configuration-issues-with-azure-databricks-workspace-using-vnet-injection.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/nat-gateway-configuration-issues-with-azure-databricks-workspace-using-vnet-injection", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are unable to launch clusters in an Azure VNET injected Databricks workspace when NAT gateway is configured.\nInvalid argument: Cannot launch the cluster because the user specified an invalid argument. Internal error message: NAT Gateway /subscriptions/.../resourceGroups/. .../providers/Microsoft.Network/natGateways/... cannot be deployed on subnet containing Basic SKU Public IP addresses or Basic SKU Load Balancer.\r\n\r\nNIC /subscriptions/.../resourceGroups/.../providers/Microsoft.Network/networkInterfaces/...-publicNIC/ipConfigurations/ipConfig in subnet /subscriptions/.../resourceGroups/npdac-rg/providers/Microsoft.Network/virtualNetworks/npdac-vnet/subnets/... has reference to Basic SKU Public IP address or Load Balancer /subscriptions/.../resourceGroups/.../providers/Microsoft.Network/publicIPAddresses/....\nCause\nSecure cluster connectivity (also known as no public IP or NPIP) is not enabled on the workspace. This results in a conflict as the NAT gateway can not be placed on subnets containing\nBasic SKU public IP addresses\nor Basic SKU load balancer configurations. This is a documented\nNAT gateway limitation\n.\nSolution\nYou should\nEnable secure cluster connectivity\non your workspace. After secure cluster connectivity is enabled, the workspace functions normally.\nFor more information, review the\nEgress with VNet injection\ndocumentation.\nIf you cannot use secure cluster connectivity for some reason, you should configure the workspace to use a single IP and align the firewall settings.\nFor more information, review the\nAssign a single public IP for VNet-injected workspaces using Azure Firewall\nKB article." +} \ No newline at end of file diff --git a/scraped_kb_articles/nbconvert-wrong-color-assert.json b/scraped_kb_articles/nbconvert-wrong-color-assert.json new file mode 100644 index 0000000000000000000000000000000000000000..22e55bb8fe8eca2e0e5a9c79841a97b5a32df53f --- /dev/null +++ b/scraped_kb_articles/nbconvert-wrong-color-assert.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/nbconvert-wrong-color-assert", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou run a Python notebook and it fails with an\nAssertionError: wrong color format\nmessage.\nAn example stack trace:\nFile \"/local_disk0/tmp/1599775649524-0/PythonShell.py\", line 39, in \r\n    from IPython.nbconvert.filters.ansi import ansi2html\r\n  File \"\", line 983, in _find_and_load\r\n  File \"\", line 963, in _find_and_load_unlocked\r\n  File \"\", line 902, in _find_spec\r\n  File \"\", line 876, in _find_spec_legacy\r\n  File \"/databricks/python/lib/python3.7/site-packages/IPython/utils/shimmodule.py\", line 36, in find_module\r\n    mod = import_item(mirror_name)\r\n  File \"/databricks/python/lib/python3.7/site-packages/IPython/utils/importstring.py\", line 31, in import_item\r\n    module = __import__(package, fromlist=[obj])\r\n  File \"/databricks/python/lib/python3.7/site-packages/nbconvert/__init__.py\", line 4, in \r\n    from .exporters import *\r\n  File \"/databricks/python/lib/python3.7/site-packages/nbconvert/exporters/__init__.py\", line 4, in \r\n    from .slides import SlidesExporter\r\n  File \"/databricks/python/lib/python3.7/site-packages/nbconvert/exporters/slides.py\", line 12, in \r\n    from ..preprocessors.base import Preprocessor\r\n  File \"/databricks/python/lib/python3.7/site-packages/nbconvert/preprocessors/__init__.py\", line 7, in \r\n    from .csshtmlheader import CSSHTMLHeaderPreprocessor\r\n  File \"/databricks/python/lib/python3.7/site-packages/nbconvert/preprocessors/csshtmlheader.py\", line 14, in \r\n    from jupyterlab_pygments import JupyterStyle\r\n  File \"/databricks/python/lib/python3.7/site-packages/jupyterlab_pygments/__init__.py\", line 4, in \r\n    from .style import JupyterStyle\r\n  File \"/databricks/python/lib/python3.7/site-packages/jupyterlab_pygments/style.py\", line 10, in \r\n    class JupyterStyle(Style):\r\n  File \"/databricks/python/lib/python3.7/site-packages/pygments/style.py\", line 101, in __new__\r\n    ndef[0] = colorformat(styledef)\r\n  File \"/databricks/python/lib/python3.7/site-packages/pygments/style.py\", line 58, in colorformat\r\n    assert False, \"wrong color format %r\" % text\r\nAssertionError: wrong color format 'var(--jp-mirror-editor-variable-color)'\nCause\nThis is caused by an incompatible version of the\nnbconvert\nlibrary. If you do not have\nnbconvert\npinned to the correct version, it is possible to accidentally install an incompatible version via PyPI.\nSolution\nManually install\nnbconvert\nversion 6.0.0rc0 on the cluster. This overrides any incorrect version of the library that may have been installed.\nClick the clusters icon in the sidebar.\nClick the cluster name.\nClick the\nLibraries\ntab.\nClick\nInstall New\n.\nIn the Library Source button list, select\nPyPi\n.\nEnter\nnbconvert==6.0.0rc0\nin the\nPackage\nfield.\nClick\nInstall\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/need-to-see-job-creator-when-investigating-a-job-but-can-only-see-service-principal.json b/scraped_kb_articles/need-to-see-job-creator-when-investigating-a-job-but-can-only-see-service-principal.json new file mode 100644 index 0000000000000000000000000000000000000000..388ac93a124e2cdea162fa2f33c3224f7c294de9 --- /dev/null +++ b/scraped_kb_articles/need-to-see-job-creator-when-investigating-a-job-but-can-only-see-service-principal.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/need-to-see-job-creator-when-investigating-a-job-but-can-only-see-service-principal", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to investigate job failures or resource consumption in an automated process, you need to identify the actual user or system that initiated a job, but are only able to see the service principal.\nCause\nService principals are commonly used for automation and identity management in place of a specific user or group identity.\nSolution\nUse the Databricks API to retrieve job creator details under the service principal. The following script fetches the job details, including the job creator's information, when provided with a job ID.\nFor more information, review the\nGet a single job\nAPI documentation.\nimport requests\r\nimport json\r\n\r\n# Databricks API information\r\nDATABRICKS_HOST = \"\"  # Replace with your Databricks instance URL\r\nDATABRICKS_TOKEN = \"\"  # Replace with your Databricks access token\r\nJOB_ID = \"\"  # Replace with your Databricks job ID\r\n\r\n# API URL for fetching job details\r\napi_url = f\"{DATABRICKS_HOST}/api/2.1/jobs/get\"\r\n\r\n# Headers with authorization\r\nheaders = {\r\n    \"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\",\r\n    \"Content-Type\": \"application/json\"\r\n}\r\n\r\n# Request payload\r\npayload = {\r\n    \"job_id\": JOB_ID\r\n}\r\n\r\n# Make the API request\r\nresponse = requests.get(api_url, headers=headers, params=payload)\r\n\r\n# Check if the request was successful\r\nif response.status_code == 200:\r\n    job_details = response.json()\r\n    \r\n    # Fetch the creator details from the response\r\n    creator_user_name = job_details.get(\"creator_user_name\", \"Creator not found\")\r\n    \r\n    print(f\"Job ID: {JOB_ID}\")\r\n    print(f\"Job Creator: {creator_user_name}\")\r\nelse:\r\n    print(f\"Failed to fetch job details. Status Code: {response.status_code}\")\r\n    print(f\"Error: {response.text}\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/need-to-track-dbu-consumption-per-cluster-and-present-clusters-in-the-workspace.json b/scraped_kb_articles/need-to-track-dbu-consumption-per-cluster-and-present-clusters-in-the-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..547caef8c2e5eaab4070b5f2a2c0fe181804ce04 --- /dev/null +++ b/scraped_kb_articles/need-to-track-dbu-consumption-per-cluster-and-present-clusters-in-the-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/need-to-track-dbu-consumption-per-cluster-and-present-clusters-in-the-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to query the\nsystem.compute.clusters\nand\nsystem.billing.usage\ntables to gain insights into DBU consumption per cluster and identify all-purpose and job clusters within your workspace.\nCause\nYou may want these insights for multiple reasons, including old deleted clusters, potential field variations, and SQL query adjustments.\nDeleted clusters\nClusters deleted before specific dates may not be included in the query results.\nField variations\nDifferences in the\nchange_time\nand\ndelete_time\nfields could sometimes cause cluster information to appear differently than expected.\nSQL query adjustments\nQueries may need to incorporate recent changes or deletions in cluster data to provide more accurate results.\nSolution\nThese steps can be implemented to achieve more refined results by utilizing multiple system tables.\nEnable required system schemas\nWhile the\nsystem.billing\nschema is enabled by default, the\nsystem.compute\nschema needs to be enabled manually.\nFor more information, please review the\nEnable system table schemas\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nRefine SQL queries\nAdjust queries to include recent cluster changes and take deleted clusters into account for more accurate results.\nApplying filters and joins using fields like\nchange_time\nand\ndelete_time\ncan help provide a comprehensive view of cluster states.\nEnsure accurate results from the\nsystem.compute.clusters\ntable\nEnsure your SQL query correctly accounts for the latest changes and deletions in cluster data.\nExample code\nIn this example code block, we are fetching the clusters present in the workspace created via UI and API.\nSELECT DISTINCT \r\n    c.workspace_id, \r\n    c.cluster_id, \r\n    c.cluster_name, \r\n    MAX(c.create_time) AS first_created, \r\n    MAX(c.change_time) AS last_changed, \r\n    c.driver_node_type, \r\n    c.worker_node_type, \r\n    c.min_autoscale_workers, \r\n    c.max_autoscale_workers, \r\n    c.auto_termination_minutes, \r\n    c.dbr_version, \r\n    c.cluster_source\r\nFROM \r\n    system.compute.clusters c\r\nINNER JOIN (\r\n    SELECT \r\n        cluster_id, \r\n        MAX(change_time) AS max_change_time \r\n    FROM \r\n        system.compute.clusters \r\n    GROUP BY \r\n        cluster_id\r\n) m \r\n    ON c.cluster_id = m.cluster_id \r\n    AND c.change_time = m.max_change_time\r\nWHERE \r\n    c.delete_time IS NULL \r\n    AND c.cluster_source IN ('UI', 'API') \r\n    AND c.workspace_id = \r\nGROUP BY \r\n    c.workspace_id, \r\n    c.cluster_id, \r\n    c.cluster_name, \r\n    c.driver_node_type, \r\n    c.worker_node_type, \r\n    c.min_autoscale_workers, \r\n    c.max_autoscale_workers, \r\n    c.auto_termination_minutes, \r\n    c.dbr_version, \r\n    c.cluster_source\r\nORDER BY \r\n    c.workspace_id, \r\n    c.cluster_id;\nRetrieve DBU consumption per cluster.\nExample code\nIn this example code block, we are fetching the DBUs consumed per cluster on a monthly basis.\nSELECT \r\n    date_format(u.usage_date, 'yyyy-MM') AS `Month`, \r\n    c.cluster_name AS `Cluster Name`, \r\n    c.cluster_id AS `Cluster ID`, \r\n    SUM(u.usage_quantity) AS `Total DBUs Consumed`\r\nFROM \r\n    system.billing.usage u\r\nINNER JOIN \r\n    system.compute.clusters c \r\n    ON u.usage_metadata.cluster_id = c.cluster_id\r\nWHERE \r\n    c.cluster_name LIKE \r\n    AND c.cluster_id LIKE \r\nGROUP BY \r\n    date_format(u.usage_date, 'yyyy-MM'), \r\n    c.cluster_name, \r\n    c.cluster_id\r\nORDER BY \r\n    `Month` ASC, \r\n    `Cluster Name` ASC;\nVerify that the\ndelete_time\nfield is correctly handled in your queries to exclude deleted clusters.\nSELECT * FROM system.compute.clusters WHERE delete_time IS NULL;\nVerify the sample queries with known data to ensure they return the expected results. If needed, customize the examples based on your specific workspace and cluster configurations.\nFor more information regarding the\nsystem.compute.clusters\ntable, please review the\nCompute system tables reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/new-columns-added-to-a-table-are-not-reflecting-in-a-view.json b/scraped_kb_articles/new-columns-added-to-a-table-are-not-reflecting-in-a-view.json new file mode 100644 index 0000000000000000000000000000000000000000..a5773b998e4b7e3249e838591f3b54372086faea --- /dev/null +++ b/scraped_kb_articles/new-columns-added-to-a-table-are-not-reflecting-in-a-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/new-columns-added-to-a-table-are-not-reflecting-in-a-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou create a view on an underlying table with the intent of keeping the view columns synchronized with the table columns. You notice new columns added to the table are not immediately available and accessible in the view.\nExample\nAfter creating a table and view, when you select from the table, it returns both columns. When you select from the view, it only returns one column.\n%sql\r\n-- create a table and view\r\nCREATE OR REPLACE TABLE my_table (c1 INT);\r\nCREATE OR REPLACE VIEW my_view AS SELECT * FROM my_table;\r\n\r\n-- add a column to the table\r\nALTER TABLE my_table ADD COLUMNS (c2 INT);\r\n\r\n-- select from the table\r\nSELECT * FROM my_table; -- ➡️ returns both `c1` and `c2` columns\r\n\r\n-- select from the view\r\nSELECT * FROM my_view; -- ➡️ only returns column `c1`\nCause\nA view created using\nselect *\nstores the column names and data types of the underlying table at the time of view creation. Any subsequent changes to the table structure, such as adding new columns, are not automatically reflected in the view because view definition is not dynamically linked to the underlying table structure.\nSolution\nThere are two approaches available.\nRebuild the existing view.\nRecreate the view using\nWITH SCHEMA EVOLUTION\n.\nRebuild the existing view\nUse the following code to initially replace an existing view. If the view has existing permissions or grants, they need to be reapplied.\nCREATE OR REPLACE VIEW AS SELECT * FROM ;\nAlternatively, if your tools allow it, use the following code to rebuild the view with\nALTER VIEW\nto keep existing permissions or grants intact.\nALTER VIEW AS SELECT * FROM ;\nRecreate the view using WITH SCHEMA EVOLUTION\nIf you use Databricks Runtime 15.3 or above, you can recreate the view with the\nWITH SCHEMA EVOLUTION\nschema binding syntax.\nWITH SCHEMA EVOLUTION\nallows the view to adapt to changes in the schema of the query due to changes in the underlying object definitions.\nIn the following code, the example from the problem statement is modified to include\nWITH SCHEMA EVOLUTION\n. After applying, both columns are returned when you select from the view.\n-- create a table and view\r\nCREATE OR REPLACE TABLE my_table (c1 INT);\r\nCREATE OR REPLACE VIEW my_view WITH SCHEMA EVOLUTION AS SELECT * FROM my_table;\r\n\r\n\r\n-- add a column to the table\r\nALTER TABLE my_table ADD COLUMNS (c2 INT);\r\n\r\n-- select from the table\r\nSELECT * FROM my_table; -- ➡️ returns both `c1` and `c2` columns\r\n\r\n-- select from the view\r\nSELECT * FROM my_view; -- ➡️ returns both `c1` and `c2` columns\nFor more information, refer to the\nCREATE VIEW\n(\nAWS\n|\nAzure\n|\nGCP\n)  and\nALTER VIEW\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/no-module-named-error-for-dependent-libraries-within-a-job-task.json b/scraped_kb_articles/no-module-named-error-for-dependent-libraries-within-a-job-task.json new file mode 100644 index 0000000000000000000000000000000000000000..5e154771e10144c8e4ea425100ed3541403dae92 --- /dev/null +++ b/scraped_kb_articles/no-module-named-error-for-dependent-libraries-within-a-job-task.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/no-module-named-error-for-dependent-libraries-within-a-job-task", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen defining libraries for workflows using job clusters, you receive an error below despite the library being present.\nModuleNotFoundError: No module named ''\nCause\nThe library is present, but it is defined in another task in the same job through the\nDependent libraries\nfield. Tasks don’t inherit libraries from other tasks.\nSolution\nIf you use all-purpose compute, install libraries at the compute level rather than the job level, so that the libraries can be shared across all tasks automatically.\nFor individual job computes, in the UI, when non-serverless job clusters are used for dependencies between the same tasks, you can share the library dependencies between them.\nOpen the first parent task.\nGo to the\nDependent libraries\nfield and include all library dependencies.\nMove to the child/dependent tasks.\nFile the parent task in the field\nDepends on\n.\nThe dependent tasks now also have all libraries set at the parent task.\nOther options\nUse notebook-scoped libraries. For details, review the\nLibraries\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nUse Terraform, where the dependencies can be defined as variables. For details, review the\nCreate clusters, notebooks, and jobs with Terraform\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nUse Databricks Asset Bundles (DAB), which allow you to reuse the dependencies between job tasks. For details, review the\nDatabricks Asset Bundles library dependencies\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you use serverless compute, you can use notebook-scoped libraries or use the\nEnvironment and Libraries\nfield to select, edit, or add a new environment. Note this will be required in every task. For more information, review the\nInstall Notebook dependencies\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/no-such-catalog-exception-error-when-trying-to-create-row-filters.json b/scraped_kb_articles/no-such-catalog-exception-error-when-trying-to-create-row-filters.json new file mode 100644 index 0000000000000000000000000000000000000000..30888c67778943ca8ea349db0448acff95f65833 --- /dev/null +++ b/scraped_kb_articles/no-such-catalog-exception-error-when-trying-to-create-row-filters.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/no-such-catalog-exception-error-when-trying-to-create-row-filters", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to create a row filter with a fully qualified table name (for example,\nTest_catalog.Test_schema.Test_table_name\n) in a Databricks environment, when you encounter a\nNO SUCH CATALOG EXCEPTION\nerror message. This error message suggests that the specified catalog does not exist, even though it does.\nThis issue occurs specifically when you are trying to apply a row filter using the\nALTER TABLE\ncommand with a fully qualified table name. However, if you set the catalog and schema with the\nuse\ncommand before running the\nALTER TABLE\ncommand, the error does not occur.\nCause\nThe\nALTER TABLE\ncommand expects the function location to be specified when it is applied to the table. If the function location is not specified, Databricks searches for the function in the default\nhive_metastore\n. If the default\nhive_metastore\ndoes not contain the required function it leads to a\nNO SUCH CATALOG EXCEPTION\nerror.\nSolution\nSpecify the function location when you use the\nALTER TABLE\ncommand to apply a row filter on a table.\nExample syntax\n%sql\r\n\r\nALTER TABLE .. SET ROW FILTER .. ON ();\nUsing the example table name it might look like this:\n%sql\r\n\r\nALTER TABLE Test_catalog.Test_schema.Test_table_name\r\nSET ROW FILTER Test_catalog.Test_schema.Test_fun ON (Test_col_name);\nFor more information, review the\nROW FILTER clause\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/no-use-perms-db.json b/scraped_kb_articles/no-use-perms-db.json new file mode 100644 index 0000000000000000000000000000000000000000..7ae0bcf14652716b1d4288dfed2454d439094c94 --- /dev/null +++ b/scraped_kb_articles/no-use-perms-db.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/no-use-perms-db", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using a cluster running Databricks Runtime 7.3 LTS and above.\nYou have enabled table access control for your workspace (\nAWS\n|\nAzure\n|\nGCP\n) as the admin user, and granted the\nSELECT\nprivilege to a standard user-group that needs to access the tables.\nA user tries to access an object in the database and gets a SecurityException error message.\nError in SQL statement: SecurityException: User does not have permission USAGE on database \nCause\nA new\nUSAGE\nprivilege was added to the available data access privileges. This privilege is enforced on clusters running Databricks Runtime 7.3 LTS and above.\nSolution\nGrant the\nUSAGE\nprivilege to the user-group.\nLogin to the workspace as an admin user.\nOpen a notebook.\nRun the following command:\n%sql\r\n\r\nGRANT USAGE ON DATABASE TO ;\nReview the USAGE privilege (\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/no-way-to-restore-dropped-managed-volumes.json b/scraped_kb_articles/no-way-to-restore-dropped-managed-volumes.json new file mode 100644 index 0000000000000000000000000000000000000000..967bde5229e3c055a229d1e10a7675fbee0a4915 --- /dev/null +++ b/scraped_kb_articles/no-way-to-restore-dropped-managed-volumes.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/no-way-to-restore-dropped-managed-volumes", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen performing actions such as moving resources between schemas, you may face a risk of not being able to restore an inadvertently dropped managed volume.\nNote\nData retention policies are distinct from issues around dropped volumes.\nCause\nRestoring a managed volume requires altering the backend database, which is not permitted.\nSolution\nImportant!\nBecause there is no direct method to restore a dropped managed volume, please use extra caution before performing actions.\nBack up a copy of your data to a safe, separate location before starting.\nIf a managed volume is dropped, create a new managed volume in the desired schema.\nUse shell commands or other data transfer methods to copy the data from the backup location to the new managed volume.\nThe best approach is to mitigate the risk generally by always using external volumes for operations that require frequent schema changes.\nFor more information, please review the\nVolumes\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/nodes_lost-error-during-cluster-upsizing-when-apache-spark-dynamic-allocation-is-enabled.json b/scraped_kb_articles/nodes_lost-error-during-cluster-upsizing-when-apache-spark-dynamic-allocation-is-enabled.json new file mode 100644 index 0000000000000000000000000000000000000000..9cf0cbd25d6b7bec66ad90fbc421808a7d640b59 --- /dev/null +++ b/scraped_kb_articles/nodes_lost-error-during-cluster-upsizing-when-apache-spark-dynamic-allocation-is-enabled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/nodes_lost-error-during-cluster-upsizing-when-apache-spark-dynamic-allocation-is-enabled", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter enabling the Apache Spark dynamic allocation configuration on a cluster, you encounter a\nNODES_LOST\nerror when upsizing the cluster. The error message typically appears as the following.\nMessage: Compute lost at least one node. Reason: Communication lost Help: Communication with at least one worker node was unexpectedly lost. This issue can occur because of instance malfunction or network unavailability. Please retry and contact Databricks if the problem persists.\nCause\nThe\nNODES_LOST\nerror indicates communication loss with one or more worker nodes.\nEnabling the Spark dynamic allocation configuration can conflict with Databricks' autoscaling mechanism, leading to node loss. Databricks clusters are managed by Databricks autoscaling, so the Spark dynamic allocation should not be configured.\nSolution\nNavigate to your cluster and click to open the settings.\nScroll down to\nAdvanced options\nand click to expand.\nUnder the\nSpark\ntab, find the\nspark.dynamicAllocation.enabled true\nconfiguration and remove it.\nFor more information regarding Spark configuration, review the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/non-admin-users-unable-to-access-azure-databricks-job-logs-for-jobs-triggered-by-azure-data-factory-adf.json b/scraped_kb_articles/non-admin-users-unable-to-access-azure-databricks-job-logs-for-jobs-triggered-by-azure-data-factory-adf.json new file mode 100644 index 0000000000000000000000000000000000000000..ed8225af284d2e0b932ad83682fccab847492f5a --- /dev/null +++ b/scraped_kb_articles/non-admin-users-unable-to-access-azure-databricks-job-logs-for-jobs-triggered-by-azure-data-factory-adf.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/non-admin-users-unable-to-access-azure-databricks-job-logs-for-jobs-triggered-by-azure-data-factory-adf", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nNon-admin users are unable to view job logs for jobs triggered by Azure Data Factory (ADF) unless they have admin permissions, preventing them from troubleshooting or understanding a job’s execution.\nWhen a non-admin users tries to access the job run, they see the following message on their job run page.\nJob run not available or not found\r\nThe URL may be misspelled or the run you're looking for is no longer available.\r\nDatabricks keeps job run results and compute logs for 60 days after termination. For runs that have not expired yet, you can export job run results. To save a copy of the compute logs for future runs, enable compute log delivery to have the logs delivered to cloud storage\r\nParent job of this run: No job found for this run | Jobs Page\nCause\nBy default, ADF-triggered job runs inherit permissions from the job’s notebook. This behavior is by design, to ensure only authorized users can view job logs and related information.\nSolution\nYou can allow non-admin users to access Azure Databricks job logs without granting full admin privileges, so they can debug or conduct analysis.\nAs an admin, follow these steps on behalf of the non-admin user:\nNavigate to your Azure Databricks workspace and locate the job's notebook.\nClick on the\nShare\nbutton on the top right corner of the notebook.\nIn the\nShare Notebook\ndialog box, search for the user or group that needs access to the job logs.\nSelect the user or group and grant them the\nCan View\npermission." +} \ No newline at end of file diff --git a/scraped_kb_articles/non-admin-users-with-can-manage-permissions-on-a-specific-dlt-cluster-can-see-the-spark-ui-but-cannot-view-the-driver-logs.json b/scraped_kb_articles/non-admin-users-with-can-manage-permissions-on-a-specific-dlt-cluster-can-see-the-spark-ui-but-cannot-view-the-driver-logs.json new file mode 100644 index 0000000000000000000000000000000000000000..e1c4c4d703f2a4ba707ef0e67c5dc4dbf6991348 --- /dev/null +++ b/scraped_kb_articles/non-admin-users-with-can-manage-permissions-on-a-specific-dlt-cluster-can-see-the-spark-ui-but-cannot-view-the-driver-logs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/non-admin-users-with-can-manage-permissions-on-a-specific-dlt-cluster-can-see-the-spark-ui-but-cannot-view-the-driver-logs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are a non-admin user with\n'CAN MANAGE'\npermissions on a specific Delta Live Tables (DLT) pipeline’s cluster. You want to view the driver logs, but they don’t appear in the Apache Spark UI.\nCause\nBy default, only the DLT pipeline owner and workspace admins have permission to view the cluster driver logs.\nSolution\nSet\nspark.databricks.acl.needAdminPermissionToViewLogs\nto\nfalse\nto enable non-admin users to view driver logs.\nUsing the REST API\nUse the Databricks REST API to edit the pipeline to include the configuration. Add the following configuration under the\n‘configuration’\nfield in the API call. For details, review the\nEdit a pipeline\nAPI documentation.\n{\r\n \"configuration\": {\r\n  \"spark.databricks.acl.needAdminPermissionToViewLogs\": \"false\"\r\n }\r\n}\nNote\nThe API is of\nPUT\ntype. Please make sure to add ALL other existing configs while editing the pipeline.\nUsing the Databricks UI\nNavigate to your Databricks workspace.\nGo to the\nSettings\ntab of the DLT pipeline.\nScroll down to\nAdvanced options\nand add the following configuration.\nspark.databricks.acl.needAdminPermissionToViewLogs false\nSave the changes and try again to view the driver logs." +} \ No newline at end of file diff --git a/scraped_kb_articles/nosuchmethoderror-import-failure-of-protobuf-java-when-trying-to-run-apache-spark-workflow.json b/scraped_kb_articles/nosuchmethoderror-import-failure-of-protobuf-java-when-trying-to-run-apache-spark-workflow.json new file mode 100644 index 0000000000000000000000000000000000000000..0ceb51f1a04e2e1934a82c9eed1624e8ef18ea32 --- /dev/null +++ b/scraped_kb_articles/nosuchmethoderror-import-failure-of-protobuf-java-when-trying-to-run-apache-spark-workflow.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/nosuchmethoderror-import-failure-of-protobuf-java-when-trying-to-run-apache-spark-workflow", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running an Apache Spark workflow in Databricks, you encounter an import failure related to\nProtobuf-java\n. You see the following error while checking the logs.\njava.lang.NoSuchMethodError: com.google.protobuf.Internal.checkNotNull(Ljava/lang/Object;)Ljava/lang/Object;    at com.google.protobuf.SingleFieldBuilderV3.(SingleFieldBuilderV3.java:57)\nCause\nThere is a Protobuf library version mismatch between Databricks Runtime and your JAR files.\nDatabricks includes a specific version of\nProtobuf-java\nin its Runtime. If you import a JAR file that depends on a different Protobuf version, it can cause conflicts, leading to import errors.\nSolution\nShade the Protobuf class in the JAR. Shading ensures that the required version of Protobuf is packaged within the JAR and does not conflict with Databricks Runtime’s built-in version.\nWhat Does It Mean to \"Shade a JAR\"?\nShading a JAR is a technique used in Java development to relocate and package dependencies inside a JAR file to avoid conflicts with other libraries or the runtime environment. It makes sure the included dependencies do not interfere with versions already present in the classpath.\nFor more information and instructions, refer to the\nDeploy Scala JARs on Unity Catalog clusters\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/not-able-to-see-all-rows-from-systeminformation_schema-tables.json b/scraped_kb_articles/not-able-to-see-all-rows-from-systeminformation_schema-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..b5c25f367ab49407d7776f3318f7727c9c0ec4ef --- /dev/null +++ b/scraped_kb_articles/not-able-to-see-all-rows-from-systeminformation_schema-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/not-able-to-see-all-rows-from-systeminformation_schema-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are not able to see all rows from\nsystem.information_schema\ntables even though you have\nSELECT\npermission on the table.\nCause\nYou can only see the rows from the catalogs you have access to.\nFor example, if there are ten catalogs in your workspace, but you have access to only four of the ten, then you can view data related to the four catalogs in\ninformation_schema\n.\nRefer to the\nTABLES\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details.\nSolution\nAsk your metastore admin or the\nsecurable object\nowner to assign\nBROWSE\npermission on the UC object to you, so you can view related information in the\ninformation_schema\ntable.\nIf you need to query data across all catalogs or schemas, instead of assigning\nBROWSE\npermissions on each UC object, ask your metastore admin to assign\nMETASTORE_ADMIN\nprivileges." +} \ No newline at end of file diff --git a/scraped_kb_articles/not-enough-disk-space-error-when-downloading-a-model-from-hugging-face.json b/scraped_kb_articles/not-enough-disk-space-error-when-downloading-a-model-from-hugging-face.json new file mode 100644 index 0000000000000000000000000000000000000000..71ef566fc71ada36a87bef83d539cdbebb968214 --- /dev/null +++ b/scraped_kb_articles/not-enough-disk-space-error-when-downloading-a-model-from-hugging-face.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/not-enough-disk-space-error-when-downloading-a-model-from-hugging-face", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are setting up a machine learning model from Hugging Face, but the download fails with an error message that says there is no more disk space.\n/databricks/python/lib/python3.11/site-packages/huggingface_hub/file_download.py:1006: UserWarning: Not enough free disk space to download the file. The expected file size is: XXXX MB. The target location /root/.cache/huggingface/hub only has XXX MB free disk space.\nCause\nThe root partition on machines is fixed and does not autoscale. If you download anything to a home directory or temporary folder on the root partition, it quickly runs out of storage space.\nSolution\nYou can set\nos.environ['HF_HUB_CACHE'] = \"\"\nto a local path with available space. Set this environment variable early, before importing transformers libs.\nAutoscaling local storage is under\n/local_disk0\n. That is the recommended choice for local storage.\nIf you want persistent storage, you can also use\n/dbfs\nor\n/Volumes\n. This is remote storage, so it may be slower to write and read as compared to using local storage." +} \ No newline at end of file diff --git a/scraped_kb_articles/not-found-404-errors-when-selecting-system-tables.json b/scraped_kb_articles/not-found-404-errors-when-selecting-system-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..18c167f6008f0248f7d8404f2604e16daac181a1 --- /dev/null +++ b/scraped_kb_articles/not-found-404-errors-when-selecting-system-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/not-found-404-errors-when-selecting-system-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using custom DNS in your Databricks data plane environment (VNet Azure and VPC for AWS) and are trying to select information from system tables when you get a 404 error message. You may also have seen unknown host exceptions or connection timeout errors.\nExample error\nFileReadException: Error while reading file delta-sharing:/…. Caused by: IOException: java.util.concurrent.ExecutionException: io.delta.sharing.spark.util.UnexpectedHttpStatus: HTTP request failed with status: HTTP/1.1 404 The specified resource does not exist. {\"error\":{\"code\":\"ResourceNotFound\",\"message\":\"The specified resource does not exist.\\nRequestId:00000000-0000-0000-0000-000000000000\\nTime:0000-00-00T00:00:00.00000000\"}}, while accessing URI of shared table file Caused by: ExecutionException: io.delta.sharing.spark.util.UnexpectedHttpStatus: HTTP request failed with status: HTTP/1.1 404 The specified resource does not exist.\nInfo\nUnknown host exceptions or connection timeouts can occur if the rotated IP address is not assigned to any storage endpoint when it fails. The 404 error message occurs when the previous IP address is rotated to a storage endpoint on a different region.\nCause\nYour custom DNS was mapped to a static IP address for Databricks artifact storage that was also used by our system tables. Databricks public IP addresses can rotate within the range specified, depending on the cloud provider.\nReview\nIP addresses and domains for Databricks services and assets\n(\nAWS\n|\nAzure\n) for your cloud provider and region.\nSolution\nThe short term workaround is to update your custom DNS to the current IP address or CNAME of the system tables endpoint of your region. You can get this IP address with a serverless cluster, or any other machine connected to a public DNS server using a nslookup command.\nYou need to modify the domain depending on your cloud and region used.\nAWS:\nnslookup system-tables-prod--uc-metastore-bucket.s3..amazonaws.com\nAzure:\nnslookup .dfs.core.windows.net\nDatabricks recommends using a DNS conditional forwarder to reference the updated domain instead of using static IP addresses with custom DNS." +} \ No newline at end of file diff --git a/scraped_kb_articles/not-receiving-emails-from-dbsql-alerts.json b/scraped_kb_articles/not-receiving-emails-from-dbsql-alerts.json new file mode 100644 index 0000000000000000000000000000000000000000..20736e96810b3225eec4062d7a00ece2aea30432 --- /dev/null +++ b/scraped_kb_articles/not-receiving-emails-from-dbsql-alerts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/not-receiving-emails-from-dbsql-alerts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are not receiving emails configured to come from Databricks SQL alerts.\nCause\nThe email address for Databricks SQL alerts is blocked in your organization.\nIn rare cases, not receiving alert emails occurs due to Databricks internal email block list as well.\nSolution\nFirst, check your work environment.\nCheck your trash folder to see if there is an auto-deletion set for such emails, causing them to go directly to the trash folder.\nEnsure in your workspace settings you have all email notifications turned ON. Navigate to\n/settings/user/notifications\nto check.\nAlso check settings at the account level.\nIf you don’t have an auto-deletion filter set and your workspace and account settings include email notifications ON, the SMTP server on your organization’s side is blocking these emails.\nAsk your IT team to help check this on their end.\nYour IT team may also be able to help you check if you have any other block list for your email where\nnoreply@databricks.com\nis added.\nIf your IT team is unable to resolve the issue:\nVerify when you last received an email from\nnoreply@databricks.com\n.\nContact Databricks support to check whether the email is part of the internal block list." +} \ No newline at end of file diff --git a/scraped_kb_articles/notebook-autosave.json b/scraped_kb_articles/notebook-autosave.json new file mode 100644 index 0000000000000000000000000000000000000000..315ba5ea113672fb6524a241302ab7c431129853 --- /dev/null +++ b/scraped_kb_articles/notebook-autosave.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/notebook-autosave", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nNotebook autosaving fails with the following error message:\nFailed to save revision: Notebook size exceeds limit. This is most commonly caused by cells with large results. Remove some cells or split the notebook.\nCause\nThe maximum notebook size allowed for autosaving is 8 MB.\nSolution\nFirst, check the size of your notebook file using your browser’s developer tools.\nIn Chrome, for example, click\nView\n>\nDeveloper\n>\nDeveloper Tools\n. Click the\nNetwork\ntab and view the\nSize\ncolumn for the notebook file.\nThen, there are two possible solutions:\nYou can manually save notebooks up to 32 MB.\nYou can reduce the size of your notebook by hiding large results.\nGraphing tools like\nplotly\nand\nmatplotlib\ncan generate large sets of results that display as large images. You can reduce the notebook size by hiding these large results and images." +} \ No newline at end of file diff --git a/scraped_kb_articles/notebook-cells-fail-to-run-with-failure-starting-repl-and-pandas-check_dependencies-errors.json b/scraped_kb_articles/notebook-cells-fail-to-run-with-failure-starting-repl-and-pandas-check_dependencies-errors.json new file mode 100644 index 0000000000000000000000000000000000000000..42c22ed47c7b81154862ea19c148c256a2caca91 --- /dev/null +++ b/scraped_kb_articles/notebook-cells-fail-to-run-with-failure-starting-repl-and-pandas-check_dependencies-errors.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/notebook-cells-fail-to-run-with-failure-starting-repl-and-pandas-check_dependencies-errors", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nTrying to run any notebook cell returns a\nFailure starting repl.\nerror message.\nFailure starting repl. Try detaching and re-attaching the notebook.\nWhen you review the stack trace it identifies a problem with\nipykernel\nas well as highlighting issues with Pandas\ncheck_dependencies\nand\nrequire_minimum_pandas_version\n.\nExample stack trace\njava.lang.Exception: Unable to start python kernel for ReplId-XXXXX-XXXXX-XXXXX-X, kernel exited with exit code 1.\r\n ----- stdout -----\r\n ------------------\r\n ----- stderr -----\r\n Traceback (most recent call last):\r\n File "/databricks/python_shell/scripts/db_ipykernel_launcher.py", line 19, in <module>\r\n from dbruntime.DatasetInfo import UserNamespaceCommandHook, UserNamespaceDict\r\n File "/databricks/python_shell/dbruntime/DatasetInfo.py", line 8, in <module>\r\n from pyspark.sql.connect.dataframe import DataFrame as ConnectDataFrame\r\n File "/databricks/spark/python/pyspark/sql/connect/dataframe.py", line 24, in <module>\r\n check_dependencies(__name__)\r\n File "/databricks/spark/python/pyspark/sql/connect/utils.py", line 34, in check_dependencies\r\n require_minimum_pandas_version()\r\n File "/databricks/spark/python/pyspark/sql/pandas/utils.py", line 29, in require_minimum_pandas_version\r\n import pandas\r\n File "/databricks/python/lib/python3.10/site-packages/pandas/__init__.py", line 22, in <module>\r\n from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401\r\n File "/databricks/python/lib/python3.10/site-packages/pandas/compat/__init__.py", line 18, in <module>\r\n from pandas.compat.numpy import (\r\n File "/databricks/python/lib/python3.10/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>\r\n from pandas.util.version import Version\r\n File "/databricks/python/lib/python3.10/site-packages/pandas/util/__init__.py", line 2, in <module>\r\n from pandas.util._decorators import ( # noqa:F401\r\n File "/databricks/python/lib/python3.10/site-packages/pandas/util/_decorators.py", line 14, in <module>\r\n from pandas._libs.properties import cache_readonly\r\n File "/databricks/python/lib/python3.10/site-packages/pandas/_libs/__init__.py", line 13, in <module>\r\n from pandas._libs.interval import Interval\r\n File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval\r\n ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject\r\n ------------------\r\n \tat com.databricks.backend.daemon.driver.IpykernelUtils$.startReplFailure$1(JupyterDriverLocal.scala:1609)\r\n \tat com.databricks.backend.daemon.driver.IpykernelUtils$.$anonfun$startIpyKernel$3(JupyterDriverLocal.scala:1619)\r\n \tat com.databricks.backend.common.util.TimeUtils$.$anonfun$retryWithExponentialBackoff0$1(TimeUtils.scala:191)\r\n \tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\r\n \tat scala.util.Try$.apply(Try.scala:213)\r\n \tat com.databricks.backend.common.util.TimeUtils$.retryWithExponentialBackoff0(TimeUtils.scala:191)\r\n \tat com.databricks.backend.common.util.TimeUtils$.retryWithExponentialBackoff(TimeUtils.scala:145)\r\n \tat com.databricks.backend.common.util.TimeUtils$.retryWithTimeout(TimeUtils.scala:94)\r\n \tat com.databricks.backend.daemon.driver.IpykernelUtils$.startIpyKernel(JupyterDriverLocal.scala:1617)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal.$anonfun$startPython$1(JupyterDriverLocal.scala:1314)\r\n \tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\r\n \tat scala.util.Try$.apply(Try.scala:213)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal.com$databricks$backend$daemon$driver$JupyterDriverLocal$$withRetry(JupyterDriverLocal.scala:1237)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal$$anonfun$com$databricks$backend$daemon$driver$JupyterDriverLocal$$withRetry$1.applyOrElse(JupyterDriverLocal.scala:1240)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal$$anonfun$com$databricks$backend$daemon$driver$JupyterDriverLocal$$withRetry$1.applyOrElse(JupyterDriverLocal.scala:1237)\r\n \tat scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)\r\n \tat scala.util.Failure.recover(Try.scala:234)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal.com$databricks$backend$daemon$driver$JupyterDriverLocal$$withRetry(JupyterDriverLocal.scala:1237)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal.startPython(JupyterDriverLocal.scala:1262)\r\n \tat com.databricks.backend.daemon.driver.PythonDriverLocalBase.$anonfun$startPythonThreadSafe$1(PythonDriverLocalBase.scala:689)\r\n \tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\r\n \tat com.databricks.backend.daemon.driver.MutuallyExclusiveSections.apply(PythonDriverLocalBase.scala:82)\r\n \tat com.databricks.backend.daemon.driver.PythonDriverLocalBase.startPythonThreadSafe(PythonDriverLocalBase.scala:689)\r\n \tat com.databricks.backend.daemon.driver.JupyterDriverLocal.<init>(JupyterDriverLocal.scala:623)\r\n \tat com.databricks.backend.daemon.driver.PythonDriverWrapper.instantiateDriver(DriverWrapper.scala:950)\r\n \tat com.databricks.backend.daemon.driver.DriverWrapper.setupRepl(DriverWrapper.scala:409)\r\n \tat com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:289)\r\n \tat java.lang.Thread.run(Thread.java:750)\nCause\nThe\nFailure starting repl.\nerror can occur when there is an incompatible version of NumPy and/or Pandas installed on a Databricks cluster.\nNote\nYou should verify that the stack trace references version issues with NumPy and/or Pandas as the\nFailure starting repl.\nerror message can have additional causes.\nIf a Python package depends on a specific NumPy and/or Pandas version, and those packages are updated to an incompatible version, an error may occur and your jobs will fail with Python errors. This can happen after a new release of NumPy and/or Pandas.\nSolution\nEnsure you are using versions of NumPy and/or Pandas that are compatible with your selected Databricks Runtime version. The default versions of these (and other libraries) are detailed in the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n,\nAzure\n,\nGCP\n) documentation.\nIf you need a specific version of NumPy and/or Pandas, you should pin the version using an\ninit script\n(\nAWS\n,\nAzure\n,\nGCP\n) or by installing the specific version as a\ncluster library\n(\nAWS\n,\nAzure\n,\nGCP\n).\nFor example, if you want to pin NumPy 1.26.4 and Pandas 2.2.2 in an init script, you should include the following line:\npip install numpy==1.26.4 pandas==2.2.2\nIf you want to install NumPy 1.26.4 and Pandas 2.2.2 as cluster libraries, you should specify the versions when installing them via the API or workspace UI.\nnumpy==1.26.4\npandas==2.2.2" +} \ No newline at end of file diff --git a/scraped_kb_articles/notebook-errors-when-workspaces-are-not-unity-catalog-enabled.json b/scraped_kb_articles/notebook-errors-when-workspaces-are-not-unity-catalog-enabled.json new file mode 100644 index 0000000000000000000000000000000000000000..9f916f7a859feadcfdadce0d63eff5b1a9922fe1 --- /dev/null +++ b/scraped_kb_articles/notebook-errors-when-workspaces-are-not-unity-catalog-enabled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/notebook-errors-when-workspaces-are-not-unity-catalog-enabled", + "title": "Título do Artigo Desconhecido", + "content": "Problem:\nUsers are experiencing errors while running notebooks in different workspaces\nSolution:\nTo resolve this issue, follow these steps:\n1. Check if the issue occurs on clusters where data_security_mode is set as 'SINGLE_USER'. If the workspace is not UC enabled, 'data_security_mode' should be 'LEGACY_SINGLE_USER_STANDARD'.\n2. Verify the JSON for the clusters where the issue occurs to see if the cluster policy is forcing the use of 'SINGLE_USER' mode.\n3. Check if the user is using a personal compute/policy that is forcing 'defaultValue': 'SINGLE_USER' instead of 'LEGACY_SINGLE_USER_STANDARD'." +} \ No newline at end of file diff --git a/scraped_kb_articles/notebook-or-workflow-fails-with-error-py4jerror-could-not-find-py4j-jar-at-error-after-trying-to-install-pypmml-on-a-cluster.json b/scraped_kb_articles/notebook-or-workflow-fails-with-error-py4jerror-could-not-find-py4j-jar-at-error-after-trying-to-install-pypmml-on-a-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..eeb904ee49626ac35d5a02cab28fe195f168d5f8 --- /dev/null +++ b/scraped_kb_articles/notebook-or-workflow-fails-with-error-py4jerror-could-not-find-py4j-jar-at-error-after-trying-to-install-pypmml-on-a-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/notebook-or-workflow-fails-with-error-py4jerror-could-not-find-py4j-jar-at-error-after-trying-to-install-pypmml-on-a-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Note\nThis KB article is for Databricks Runtimes 11.3 LTS and above. If you use Databricks Runtime 10.4 LTS, refer to the\nPyPMML fails with Could not find py4j jar error\nKB article instead.\nProblem\nPyPMML is a Python PMML scoring library. You use the following code to install PyPMML on your cluster.\n%python\r\n\r\nfrom pypmml import Model\r\nmodelb = Model.fromFile('/dbfs//DecisionTreeIris.pmml')\nYour notebook or workflow subsequently fails with the following error.\nError : Py4JError: Could not find py4j jar at\nCause\nThe Py4J library that comes preinstalled with Databricks Runtime is installed in a different location than the Py4J package you are installing. As a result, when PyPMML attempts to invoke Py4J from the standard path, it fails.\nSolution\nInstall your Py4J library into the expected location.\n1. Check the Py4J version your cluster uses. Run the following code to identify the current Py4J version.\nimport py4j\r\nprint(py4j.__version__)\r\npy4j_version=py4j.__version__\n2. Install the PyPMML library along with the appropriate version of Py4J based on your Databricks Runtime version (obtained from step 1). Then restart the Python environment.\n!pip install pypmml\r\n!pip install py4j=={py4j_version}\r\ndbutils.library.restartPython()\n3. You are expected to encounter an error,\nValueError: invalid literal for int() with base 10: b'[Global flags]\\n'\n. Run the following code to avoid the error. Refer to the\nGetting ValueError when trying to import PMML files using PyPMML\nKB article for detailed information.\nimport os\r\ntmpval = os.environ.get(\"JAVA_OPTS\", \"\")\r\ntmpval = tmpval.replace(\"-XX:+PrintFlagsFinal\", \"\")\r\ntmpval = tmpval.replace(\"-verbose:gc\", \"\")\r\nos.environ[\"JAVA_OPTS\"] = tmpval\n4. Once the environment is set up, proceed to load your PMML model." +} \ No newline at end of file diff --git a/scraped_kb_articles/notebook-stopping-with-file-read-error-even-if-operation-or-command-is-still-executing.json b/scraped_kb_articles/notebook-stopping-with-file-read-error-even-if-operation-or-command-is-still-executing.json new file mode 100644 index 0000000000000000000000000000000000000000..2865c0cdbbf0f097a82598ba1ea6a714182c7a5d --- /dev/null +++ b/scraped_kb_articles/notebook-stopping-with-file-read-error-even-if-operation-or-command-is-still-executing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/notebook-stopping-with-file-read-error-even-if-operation-or-command-is-still-executing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re working in a notebook on an interactive cluster that reads a file from Workspace files. When you execute a long-running operation or command in the notebook, you notice the command throws a file read error and stops your notebook.\nThis disrupts your workflows, results in incomplete processes, and requires the need to restart the operation.\nCause\nThe Workspace File System (WSFS) token for interactive sessions has a 36-hour timeout.\nSolution\nConsider using Databricks jobs for long-running operations.\nDatabricks jobs have a 30-day timeout, which is more suitable for extensive calculations. To create and run a Databricks job:\nNavigate to the Databricks workspace and select the\nJobs\ntab.\nClick on\n'Create Job'\nand configure the job settings, including the notebook to run and the cluster to use.\nSet the schedule and timeout settings to accommodate the long-running calculation.\nSave and run the job.\nFor detailed instructions, refer to the\nCreate and run Databricks Jobs\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nNote\nAdditionally, we recommend coordinating with the engineering team to ensure the workspace token limitation is properly documented and to explore any potential configuration changes that could extend the token's validity period for interactive sessions." +} \ No newline at end of file diff --git a/scraped_kb_articles/null-empty-strings.json b/scraped_kb_articles/null-empty-strings.json new file mode 100644 index 0000000000000000000000000000000000000000..64d0eadd942c40303ebaa0ed65e9b6f0a2672b21 --- /dev/null +++ b/scraped_kb_articles/null-empty-strings.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/null-empty-strings", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIf you save data containing both empty strings and null values in a column on which the table is partitioned, both values become null after writing and reading the table.\nTo illustrate this, create a simple\nDataFrame\n:\n%scala\r\n\r\nimport org.apache.spark.sql.types._\r\nimport org.apache.spark.sql.catalyst.encoders.RowEncoder\r\nval data = Seq(Row(1, \"\"), Row(2, \"\"), Row(3, \"\"), Row(4, \"hello\"), Row(5, null))\r\nval schema = new StructType().add(\"a\", IntegerType).add(\"b\", StringType)\r\nval df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)\nAt this point, if you display the contents of\ndf\n, it appears unchanged:\nWrite\ndf\n, read it again, and display it. The empty strings are replaced by null values:\nCause\nThis is the expected behavior. It is inherited from Apache Hive.\nSolution\nIn general, you shouldn’t use both null and empty strings as values in a partitioned column." +} \ No newline at end of file diff --git a/scraped_kb_articles/nullpointerexception-when-reading-shapefiles-from-cloud-storage-on-a-mosaic-and-gdal-enabled-cluster.json b/scraped_kb_articles/nullpointerexception-when-reading-shapefiles-from-cloud-storage-on-a-mosaic-and-gdal-enabled-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..f6ddab6bf8e51910ae2e07e0994b30ee620b2874 --- /dev/null +++ b/scraped_kb_articles/nullpointerexception-when-reading-shapefiles-from-cloud-storage-on-a-mosaic-and-gdal-enabled-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/nullpointerexception-when-reading-shapefiles-from-cloud-storage-on-a-mosaic-and-gdal-enabled-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to read a shapefile from a cloud storage bucket on a cluster with Mosaic and GDAL enabled, you get a\njava.lang.NullPointerException\nerror.\nCause\nThis issue occurs when trying to read geospatial files, such as\n.shp\nand\n.geojson\n, from any cloud-based object storage. It is not possible to use any cloud-based object storage directly with Mosaic GDAL APIs.\nSolution\nEnsure that the entire shapefile, including all necessary components (\n.shp\n,\n.shx\n,\n.dbf\n, etc.), is zipped and uploaded to a Unity Catalog volume or DBFS storage.\nFor more information, please review the\nMosaic + GDAL Shapefile Example\nfrom the Mosaic GitHub repository." +} \ No newline at end of file diff --git a/scraped_kb_articles/object-ownership-is-getting-changed-on-dropping-and-recreating-tables.json b/scraped_kb_articles/object-ownership-is-getting-changed-on-dropping-and-recreating-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..71fc0fb07627e9f95ad7e75aa23f712d68b4cca1 --- /dev/null +++ b/scraped_kb_articles/object-ownership-is-getting-changed-on-dropping-and-recreating-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/object-ownership-is-getting-changed-on-dropping-and-recreating-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nOwnership of SQL objects changes after dropping and recreating them. This can result in job failures due to permission issues.\nCause\nIn Databricks Runtime 7.3 LTS, when jobs are run with table ACLs turned off, any action that drops and recreates tables or views preserves the table ACLs that was set the last time the job was run with table ACLs turned on.\nIn Databricks Runtime 9.1 LTS and above this behavior changed. Any action that drops a table or view clears the table ACL state.\nSolution\nYou should use\nTRUNCATE\nor\nREPLACE\nfor tables and\nALTER VIEW\nfor views instead of dropping and recreating them.\nTo replace a view, you should be the owner of the view or an administrator.\nDelete\nInfo\nIf you want to restore the behavior from Databricks Runtime 7.3 LTS, you can add\nspark.databricks.acl.enforceTableOwnerAssignment false\nto the cluster's\nSpark config\n.\nspark.databricks.acl.enforceTableOwnerAssignment\nwas introduced in Databricks Runtime 9.1 LTS.\nPreviously, when objects were created outside of a table ACL enabled cluster the ACL system had no knowledge of them. A Databricks administrator would have to set ownership permissions for new objects and clean up dangling permissions for deleted objects.\nNow, objects created outside of Databricks SQL or table ACL enabled clusters create representations in the ACL system, assigning ownership automatically or dropping permissions as needed." +} \ No newline at end of file diff --git a/scraped_kb_articles/office365-library-installation-causes-numpydtype-size-change-error-while-executing-notebook-commands.json b/scraped_kb_articles/office365-library-installation-causes-numpydtype-size-change-error-while-executing-notebook-commands.json new file mode 100644 index 0000000000000000000000000000000000000000..664e559e29d35ab4fd9a63fd10cbf132d4cffe64 --- /dev/null +++ b/scraped_kb_articles/office365-library-installation-causes-numpydtype-size-change-error-while-executing-notebook-commands.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/office365-library-installation-causes-numpydtype-size-change-error-while-executing-notebook-commands", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter installing the Office365 library on your cluster, you attempt to run a notebook. You encounter an error preventing you from executing code or tasks within the notebook.\n“Failure starting repl. Try detaching and re-attaching the notebook. ValueError: numpy.dtype size changed, may indicate binary incompatibility\"\nCause\nWhen you install the Office365 library (currently, version 0.3.15) on your cluster, that library includes dependent libraries which are incompatible with the NumPy library default version on Databricks Runtime.\nSpecifically, the Office365 library relies on MoviePy, which uses NumPy. The latest version of MoviePy (currently, version: 2.1.1) upgrades the built-in NumPy version on the cluster, which creates an incompatibility issue in the environment.\nSolution\nPin the specific MoviePy version that uses the compatible NumPy version built into your Databricks Runtime version.\nFor more information, review the\nCluster libraries\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nImportant\nPinning a specific version of a library dependency may mean you do not receive any patches or new features. Pinning may also cause dependency issues based on the custom libraries you’re using.\nTo prevent similar issues in the future, Databricks recommends:\nRegularly reviewing and updating your library dependencies to ensure compatibility with the required NumPy version.\nMonitoring the Databricks release notes and documentation for updates and changes that may affect your libraries and dependencies.\nFor more information about pre-installed libraries on Databricks Runtime versions, refer to the\nDatabricks Runtime release notes versions and compatibility\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/offset-reprocessing-issues-in-streaming-queries-with-a-kafka-source.json b/scraped_kb_articles/offset-reprocessing-issues-in-streaming-queries-with-a-kafka-source.json new file mode 100644 index 0000000000000000000000000000000000000000..21d6ed037482af765177dfc302610afbda72eab0 --- /dev/null +++ b/scraped_kb_articles/offset-reprocessing-issues-in-streaming-queries-with-a-kafka-source.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/offset-reprocessing-issues-in-streaming-queries-with-a-kafka-source", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Apache Spark Structured Streaming to source data from a Kafka topic and write it to a Delta table sink, but challenges arise when attempting to reprocess data from the earliest offset in the topic. The stream is appropriately updated with the option\n\"startingOffsets\": \"earliest\"\nand restarted. However, the streaming query fails to consume data from the intended earlier offset position. The issue persists, even after you have restarted the Databricks cluster associated with the streaming query.\nCause\nIf the option\n\"startingOffsets\": \"earliest\"\nis set, and upon restarting the streaming query it still fails to consume from the earliest offset position, it may be due to previously running the query without this option. Specifically, if a streaming query is started without the\n\"startingOffsets\"\noption, and has already completed at least one micro-batch, the\n“startingOffsets”\noption may not take effect as intended.\nFor Structured Streaming queries with a Kafka source, the\n\"startingOffsets\"\noption is only applicable when initiating a new streaming query. Structured Streaming manages offset consumption internally, through its checkpoint location, independent of the Kafka consumer. As a result, resuming or restarting a query uses the checkpoint location and disregards the\n\"startingOffsets\"\noption.\nSolution\nTo reprocess data from the beginning, you must use a new checkpoint directory for the streaming job. By utilizing a distinct checkpoint directory, you ensure that the streaming engine uses the specified\n\"startingOffsets\": \"earliest\"\noption when initializing the streaming query, allowing it to consume data from the earliest offset position as intended." +} \ No newline at end of file diff --git a/scraped_kb_articles/oidc-single-sign-on-authentication-error-during-login.json b/scraped_kb_articles/oidc-single-sign-on-authentication-error-during-login.json new file mode 100644 index 0000000000000000000000000000000000000000..a695ae2dbd802b5dbdc70b5e6ca4877184c4c904 --- /dev/null +++ b/scraped_kb_articles/oidc-single-sign-on-authentication-error-during-login.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/oidc-single-sign-on-authentication-error-during-login", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to log in via Single Sign-On (SSO) on Databricks, you encounter the following error on the login page.\nOIDC Single Sign-on Authentication Error.\nCause\nThe identity provider (IDP) is returning a non-200 HTTP response during the OIDC authentication process.\nSolution\nGenerate a HAR File. Follow the steps in the\nGenerate Browser HAR Files\nKB article.\nExamine the HAR File.\nSearch for requests to the following URL.\nhttps://accounts.cloud.databricks.com/oidc/consume\nLook for the corresponding response, which typically includes a redirect to either a successful login or an error message, such as:\nhttps://accounts.cloud.databricks.com/login?error=\nCheck the error details for an\noidc_code_exchange_failure\nmessage.\nUpdate or reissue your client secret.\nIf you do not see the\noidc_code_exchange_failure\nmessage, the SSO error is related to something else. Contact Databricks Support for assistance in further diagnosing the error." +} \ No newline at end of file diff --git a/scraped_kb_articles/onehotencoderestimator-error.json b/scraped_kb_articles/onehotencoderestimator-error.json new file mode 100644 index 0000000000000000000000000000000000000000..4f068391d9d8d40ece89428e70eb5327d141fb08 --- /dev/null +++ b/scraped_kb_articles/onehotencoderestimator-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/onehotencoderestimator-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have migrated a notebook from Databricks Runtime 6.4 for Machine Learning or below to Databricks Runtime 7.3 for Machine Learning or above.\nYou are attempting to import\nOneHotEncoderEstimator\nand you get an import error.\nImportError: cannot import name 'OneHotEncoderEstimator' from 'pyspark.ml.feature' (/databricks/spark/python/pyspark/ml/feature.py)\nCause\nOneHotEncoderEstimator\nwas renamed to\nOneHotEncoder\nin Apache Spark 3.0.\nSolution\nYou must replace\nOneHotEncoderEstimator\nreferences in your notebook with\nOneHotEncoder\n.\nFor example, the following sample code returns an import error in Databricks Runtime 7.3 for Machine Learning or above:\n%python\r\n\r\nfrom pyspark.ml.feature import OneHotEncoderEstimator\nThe following sample code functions correctly in Databricks Runtime 7.3 for Machine Learning or above:\n%python\r\n\r\nfrom pyspark.ml.feature import OneHotEncoder" +} \ No newline at end of file diff --git a/scraped_kb_articles/openssl-ssl_connect-ssl_error_syscall-error.json b/scraped_kb_articles/openssl-ssl_connect-ssl_error_syscall-error.json new file mode 100644 index 0000000000000000000000000000000000000000..89b64548de2536c184d74bfe4170485290b49fb2 --- /dev/null +++ b/scraped_kb_articles/openssl-ssl_connect-ssl_error_syscall-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/openssl-ssl_connect-ssl_error_syscall-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to install third-party libraries via an init script. The init script attempts to download the libraries using curl or wget, but the download fails with an SSL error message.\ncurl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to :443\nCause\nThe\nOpenSSL SSL_connect: SSL_ERROR_SYSCALL\nerror means that your cluster does not have the required SSL certificates to validate the connection to the host.\nSolution\nYou need to install the host's SSL certificate on your cluster.\nWhen the\nOpenSSL SSL_connect: SSL_ERROR_SYSCALL\nerror occurs you can use\ncurl\nto debug the issue.\n%sh \"curl -v \"\nThis option gives you debug output while connecting to the target server. You can check the handshake failure while connecting as well.\nExample result\n* Trying REDACTED...\r\n* TCP_NODELAY set\r\n* Connected to hostname (REDACTED) port 443 (#0)\r\n* ALPN, offering h2\r\n* ALPN, offering http/1.1\r\n* successfully set certificate verify locations:\r\n* CAfile: /etc/ssl/certs/ca-certificates.crt\r\n CApath: /etc/ssl/certs\r\n} [5 bytes data]\r\n* TLSv1.3 (OUT), TLS handshake, Client hello (1): <---- 1st step in TLS handshake, we do not get to the 2nd\r\n} [512 bytes data]\r\n* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in :443\r\n* Closing connection 0\r\ncurl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to :443\nGet the certificates used by the target server\nYou can use Google Chrome to export the SSL certificate of the target website.\nClick the Secure button (a padlock) in an address bar.\nClick\nCertificate (Valid)\n.\nClick\nDetails\n.\nClick\nCopy to File…\n.\nClick\nNext\n.\nSelect\nBase-64 encoded X.509 (.CER)\n.\nClick\nNext\n.\nSave the SSL certificate file as\nlocal-ca.crt\n.\nClick\nNext\n.\nClick\nFinish\n.\nUpload certificate files to your workspace\nSave the certificates as workspace files.\nReview the\nWorkspace files basic usage\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information.\nCreate an init script to install the certificates\nCreate a new file in your workspace called\ncertificates.sh\nin the\n/Shared/\nfolder.\nClick\nWorkspace\nin the left side menu.\nClick\nWorkspace\n.\nRight-click\nShared\n.\nClick\nCreate\n.\nClick\nFile\n.\nEnter certificates.sh in the text entry field and click\nCreate File\n.\nCopy-and-paste this init script into the file you just created.\n\nshould be replaced with the path to where you saved the\nlocal-ca.crt\nfile in your workspace.\n#!/bin/bash\r\nsudo apt-get install -y ca-certificates\r\nsudo cp /local-ca.crt /usr/local/share/ca-certificates\r\nsudo update-ca-certificates\nVerify init script permissions\nCheck the permissions for\ncertificates.sh\nand make sure all users who are creating clusters have the\ncan_run\nand\ncan_read\npermissions. The creator of the file and administrators have all permissions by default.\nConfigure a cluster-scoped init script\nFollow the\nUse cluster-scoped init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to configure\ncertificates.sh\nas a cluster-scoped init script.\nInfo\nThe\ncertificates.sh\ninit script should run before the init script that generated the\nOpenSSL SSL_connect: SSL_ERROR_SYSCALL\nerror message.\nRestart the cluster\nRestart the cluster and verify that your init scripts successfully complete.\nFor more information on the certificate store, review the Ubuntu\nCA trust store\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/optimize-delta-sink-structured-streaming.json b/scraped_kb_articles/optimize-delta-sink-structured-streaming.json new file mode 100644 index 0000000000000000000000000000000000000000..ece1a6939965268adf21a6e3a2ab9615444cada7 --- /dev/null +++ b/scraped_kb_articles/optimize-delta-sink-structured-streaming.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/optimize-delta-sink-structured-streaming", + "title": "Título do Artigo Desconhecido", + "content": "You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster.\nIf your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch.\nThe\nautoOptimize\noperation compacts to 128 MB files. An explicit optimize operation compacts Delta Lake files to 1 GB files.\nIf you do not have a sufficient number of eligible files in each microbatch, you should optimize the Delta table files periodically.\nUse\nforeachBatch\nwith a mod value\nOne of the easiest ways to periodically optimize the Delta table sink in a structured streaming application is by using\nforeachBatch\nwith a mod value on the microbatch\nbatchId\n.\nAssume that you have a streaming DataFrame that was created from a Delta table. You use\nforeachBatch\nwhen writing the streaming DataFrame to the Delta sink.\nWithin\nforeachBatch\n, the mod value of\nbatchId\nis used so the\noptimize\noperation is run after every 10 microbatches, and the\nzorder\noperation is run after every 101 microbatches.\n%scala\r\n\r\nval df = spark.readStream.format(\"delta\").table(\"\")\r\ndf.writeStream.format(\"delta\")\r\n  .foreachBatch{ (batchDF: DataFrame, batchId: Long) =>\r\n    batchDF.persist()\r\n    if(batchId % 10 == 0){spark.sql(\"optimize \")}\r\n    if(batchId % 101 == 0){spark.sql(\"optimize zorder by ()\")}\r\n    batchDF.write.format(\"delta\").mode(\"append\").saveAsTable(\"\")\r\n  }.outputMode(\"update\")\r\n  .start()\nYou can modify the mod value as appropriate for your structured streaming application." +} \ No newline at end of file diff --git a/scraped_kb_articles/optimize-is-only-supported-for-delta-tables-error-on-delta-lake.json b/scraped_kb_articles/optimize-is-only-supported-for-delta-tables-error-on-delta-lake.json new file mode 100644 index 0000000000000000000000000000000000000000..254c4f7c8dc703197e2528d5606939dbd1fb3837 --- /dev/null +++ b/scraped_kb_articles/optimize-is-only-supported-for-delta-tables-error-on-delta-lake.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/optimize-is-only-supported-for-delta-tables-error-on-delta-lake", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou run\nOPTIMIZE\non a Delta table and get an error message saying it is only supported on Delta tables.\nError: ``.``is not a Delta table. OPTIMIZE is only supported for Delta tables.\nCause\nThis can happen if the target table's storage location was modified and the table was recreated with a new storage location before you tried to run\nOPTIMIZE\n.\nIf you review the driver logs, you see that there is no Delta log for the table at the old location.\nINFO DeltaLog: No delta log found for the Delta table at \nThis means the metadata is still pointing to the old table location. It has not been updated with the new (current) table location.\nSolution\nEnsure the Delta table is recreated in the new location using\nCREATE OR REPLACE TABLE\n(\nAWS\n|\nAzure\n|\nGCP\n). This replaces the Delta table.\nAfter the Delta table has been moved, run\nFSCK REPAIR TABLE\n(\nAWS\n|\nAzure\n|\nGCP\n).\nFSCK REPAIR TABLE ``.``\nRun\nOPTIMIZE\nto optimize the Delta table. It should run successfully complete.\nOPTIMIZE ```" +} \ No newline at end of file diff --git a/scraped_kb_articles/optimize-streaming-transactions-with-trigger.json b/scraped_kb_articles/optimize-streaming-transactions-with-trigger.json new file mode 100644 index 0000000000000000000000000000000000000000..63df8d6b7197d4ffb59792345dd78c8a05520d50 --- /dev/null +++ b/scraped_kb_articles/optimize-streaming-transactions-with-trigger.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/optimize-streaming-transactions-with-trigger", + "title": "Título do Artigo Desconhecido", + "content": "When running a structured streaming application that uses cloud storage buckets (S3, ADLS Gen2, etc.) it is easy to incur excessive transactions as you access the storage bucket.\nFailing to specify a\n.trigger\noption in your streaming code is one common reason for a high number of storage transactions. When a\n.trigger\noption is not specified, the storage can be polled frequently. This happens immediately after the completion of each micro-batch by default.\nThe default behavior is described in the official\nApache Spark documentation on triggers\nas, \"If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.\"\nThis sample code does not have a\n.trigger\noption defined. If run, it would result in excessive storage transactions.\n%python\r\n\r\nspark.readStream.format(\"delta\").load(\"\")\r\n.writeStream\r\n.format(\"delta\")\r\n.outputMode(\"append\")\r\n.option(\"checkpointLocation\",\"\")\r\n.options(**writeConfig)\r\n.start()\nYou can reduce the number of storage transactions by setting the .\ntrigger\noption in the\n.writeStream\n. Setting\n.trigger\nprocessing time to a few seconds prevents short polling.\nInstructions\nThe default behavior is to check the source for updates every 10 ms. For most users, a longer interval between source updates will have no noticeable effect on performance, but the transaction costs are greatly reduced.\nFor example, let's use a processing time of 5 seconds. That is 500 times slower than 10 ms. The storage calls are reduced accordingly.\nSetting a processing time of 5 seconds requires adding\n.trigger(processingTime='5 seconds')\nto the\n.writeStream\n.\nFor example, modifying our existing sample code to include a\n.trigger\nprocessing time of 5 seconds only requires the addition of one line.\n%python\r\n\r\nspark.readStream.format(\"delta\").load(\"\")\r\n.writeStream\r\n.format(\"delta\")\r\n.trigger(processingTime='5 seconds') #Added line of code that defines .trigger processing time.\r\n.outputMode(\"append\")\r\n.option(\"checkpointLocation\",\"\")\r\n.options(**writeConfig)\r\n.start()\nYou should experiment with the\n.trigger\nprocessing time to determine a value that is optimized for your application." +} \ No newline at end of file diff --git a/scraped_kb_articles/oracle-federation-failing-to-find-a-data-source.json b/scraped_kb_articles/oracle-federation-failing-to-find-a-data-source.json new file mode 100644 index 0000000000000000000000000000000000000000..36c56208915768b53d0bd073a31a9cf9d2ccae21 --- /dev/null +++ b/scraped_kb_articles/oracle-federation-failing-to-find-a-data-source.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/oracle-federation-failing-to-find-a-data-source", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen exploring a federated catalog in the UI connected to an Oracle database, you receive an error message.\nFailed to find the data source: UNKNOWN_CONNECTION_TYPE\nCause\nYou are using an unsupported version of Databricks Runtime to access the table or catalog.\nAdditionally, network connectivity issues between the Databricks compute resource and Oracle can contribute to the issue.\nSolution\nIf you are using all-purpose compute, ensure that you are using Databricks Runtime 16.1 or above. For more information, refer to the\nRun federated queries on Oracle\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you are using SQL Pro Warehouse, ensure it is running version 2024.50 or above.\nTo ensure network connectivity between the Databricks compute resource and Oracle, run the following command in a notebook.\n%sh\r\nnc -vz  \nIf network connectivity is failing, follow steps in the\nNetworking recommendations for Lakehouse Federation\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to establish the connection." +} \ No newline at end of file diff --git a/scraped_kb_articles/oracle-jdbc-connection-fails-when-using-keystore-and-truststore-wallets-in-a-standard-compute.json b/scraped_kb_articles/oracle-jdbc-connection-fails-when-using-keystore-and-truststore-wallets-in-a-standard-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..8203f0c019894ccf53261207102f0888a28f91df --- /dev/null +++ b/scraped_kb_articles/oracle-jdbc-connection-fails-when-using-keystore-and-truststore-wallets-in-a-standard-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/oracle-jdbc-connection-fails-when-using-keystore-and-truststore-wallets-in-a-standard-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re trying to connect to your Oracle database from a standard compute using a JDBC connection. The JDBC connection uses the\njavax.net.trustStore\nor the\njavax.net.keyStore\nproperties that point to the\nwallet.sso\nfiles from a volume or storage path.\nThe connection attempt fails with a\n“handshake_failure”\nerror.\nCause\nOracle can’t access the\nwallet.sso\nfiles.\nSolution\nUse an init script to make the files available to the compute.\n1. Add the following init script to the volume or storage path to copy the\nwallet.sso\nfiles to the\n/cert\nlocation.\n#!/bin/bash\r\nmkdir /cert\r\ncp /cert\r\ngroupadd spark-users\r\nchgrp -R spark-users /cert\r\nchmod 440 /cert/cwallet_dev.sso\n2. Add the init script to an allowlist. For details, refer to the\nAllowlist libraries and init scripts on compute with standard access mode (formerly shared access mode)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n3. Attach the allowlisted init script to the standard compute. For details, refer to the “Configure a cluster-scoped init script using the UI” section of the\nCluster-scoped init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n4. Add the Apache Spark configuration\nspark.connect.perserveOptionCasing\ntrue to the compute. For details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n5. In your Oracle JDBC connection, use the\n/cert/wallet.sso\npath." +} \ No newline at end of file diff --git a/scraped_kb_articles/orc-tables-not-recognized-when-processed-in-serverless-warehouses.json b/scraped_kb_articles/orc-tables-not-recognized-when-processed-in-serverless-warehouses.json new file mode 100644 index 0000000000000000000000000000000000000000..911ce0af8de93636c3a5d6f8abd3a0718581e5f8 --- /dev/null +++ b/scraped_kb_articles/orc-tables-not-recognized-when-processed-in-serverless-warehouses.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/orc-tables-not-recognized-when-processed-in-serverless-warehouses", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you create a table in ORC format, either in Hive or with Hive’s\nCREATE\nsyntax in Databricks, the table is not recognized as an ORC table when processed in serverless warehouses. You receive the following error.\n\"Can't insert into the target. Can only write data to relations with a single path but given paths are []. SQLSTATE: 42809\"\nHowever, the\nINSERT\ncommand works on all-purpose clusters.  This may block you from migrating to a serverless warehouse.\nCause\nINSERT\nis not supported for Hive tables with any column that has a\ntimestamp-millis\ndata type in serverless warehouses, and Hive tables often use that data type.\nSolution\nTo enable the ORC table to work in serverless warehouses, recreate the ORC table as an Apache Spark format ORC table with\n\"USING ORC\"\ninstead of\n\"STORED AS ORC\"\nin the\nCREATE TABLE\nstatement.\nExample\nCREATE TABLE hive_metastore.hms_schema.orc_table (\r\ncolumn1 STRING,\r\ncolumn2 INT,\r\ncolumn3 TIMESTAMP\r\n)\r\nUSING ORC\r\nPARTITIONED BY (partition_column STRING, dynamic_partition STRING);\nPreventative measures\nDatabricks recommends converting all existing Hive tables to Spark tables to avoid similar issues in the future. For more information, refer to the\nINSERT\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/other-users-can-see-root-main-folder-despite-not-having-access-permissions.json b/scraped_kb_articles/other-users-can-see-root-main-folder-despite-not-having-access-permissions.json new file mode 100644 index 0000000000000000000000000000000000000000..e766befe1df338db1fab15b777d8502d7cc65348 --- /dev/null +++ b/scraped_kb_articles/other-users-can-see-root-main-folder-despite-not-having-access-permissions.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/other-users-can-see-root-main-folder-despite-not-having-access-permissions", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice your workspace root (main) folder is visible to other users in the workspace, even though the folder permissions are set to you and the workspace admin.\nCause\nUnder workspace/Users, if a user has permission to view a nested object within a folder, they will also have visibility of the folder.\nWhen you share a file or object inside your root folder and then move it to the Trash, its permissions remain intact for 30 days. During this period, a user with access to the object in the Trash retains the ability to view the parent folder, even though they cannot access its contents.\nSolution\nNavigate to the Trash folder and permanently delete all objects in the Trash. This immediately removes any residual permissions tied to those objects.\nBeyond manual Trash cleaning, the Daily Task Service permanently deletes objects in the Trash after 30 days. Once the objects are deleted, unauthorized users will no longer see the main folder.\nDatabricks recommends avoiding sharing files directly from sensitive folders. Instead, create separate folders for shared files to better control visibility and permissions." +} \ No newline at end of file diff --git a/scraped_kb_articles/overlapping-paths-error-when-querying-both-hive-and-unity-catalog-tables.json b/scraped_kb_articles/overlapping-paths-error-when-querying-both-hive-and-unity-catalog-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..c93c4545d43bc89ff9aafac3333112b95cd8c82e --- /dev/null +++ b/scraped_kb_articles/overlapping-paths-error-when-querying-both-hive-and-unity-catalog-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/overlapping-paths-error-when-querying-both-hive-and-unity-catalog-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running queries when you get an\noverlapping paths\nerror message.\norg.apache.spark.SparkException: Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported.\nCause\nAn\noverlapping paths\nerror happens when a single cell in a notebook queries both an Apache Hive table and a Unity Catalog table that both refer to the same external storage path.\nSolution\nSplit your Hive table queries and Unity Catalog table queries into different cells.\nDelete\nInfo\nYou can easily tell the difference between the two table types by the namespace notation used.\nHive tables use two-level namespace notation\n.\n, while Unity Catalog table use a three-level namespace notation\n..\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/overwrite-log4j-logs.json b/scraped_kb_articles/overwrite-log4j-logs.json new file mode 100644 index 0000000000000000000000000000000000000000..f740320971acdc30dea606786b8462c380a3885c --- /dev/null +++ b/scraped_kb_articles/overwrite-log4j-logs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/overwrite-log4j-logs", + "title": "Título do Artigo Desconhecido", + "content": "Delete\nWarning\nThis article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (\nCVE-2021-4104\n,\nCVE-2020-9488\n, and\nCVE-2019-17571\n). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities. You should not enable either of these classes in your cluster.\nThere is no standard way to overwrite\nlog4j\nconfigurations on clusters with custom configurations. You must overwrite the configuration files using init scripts.\nThe current configurations are stored in two\nlog4j.properties\nfiles:\nOn the driver:\n%sh\r\n\r\ncat /home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties\nOn the worker:\n%sh\r\n\r\ncat /home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties\nTo set class-specific logging on the driver or on workers, use the following script:\n%sh\r\n\r\n#!/bin/bash\r\necho \"Executing on Driver: $DB_IS_DRIVER\"\r\nif [[ $DB_IS_DRIVER = \"TRUE\" ]]; then\r\nLOG4J_PATH=\"/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties\"\r\nelse\r\nLOG4J_PATH=\"/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties\"\r\nfi\r\necho \"Adjusting log4j.properties here: ${LOG4J_PATH}\"\r\necho \"log4j.=\" >> ${LOG4J_PATH}\nReplace\n\nwith the property name, and\n\nwith the property value.\nUpload the script to DBFS and select a cluster using the cluster configuration UI.\nYou can also set\nlog4j.properties\nfor the driver in the same way.\nSee Cluster node initialization scripts (\nAWS\n|\nAzure\n|\nGCP\n) for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/parallelize-fs-operations.json b/scraped_kb_articles/parallelize-fs-operations.json new file mode 100644 index 0000000000000000000000000000000000000000..c63e60c5b4bbecec67617148676e10105f7cd68a --- /dev/null +++ b/scraped_kb_articles/parallelize-fs-operations.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/parallelize-fs-operations", + "title": "Título do Artigo Desconhecido", + "content": "When you need to speed up copy and move operations, parallelizing them is usually a good option. You can use Apache Spark to parallelize operations on executors. On Databricks you can use\nDBUtils\nAPIs, however these API calls are meant for use on driver nodes, and shouldn’t be used on Spark jobs running on executors.\nIn this article, we are going to show you how to use the Apache Hadoop\nFileUtil\nfunction along with DBUtils to parallelize a Spark copy operation.\nYou can use this example as a basis for other filesystem operations.\nNote\nThe example copy operation may look familiar as we are using DBUtils and Hadoop FileUtil to emulate the functions of the Hadoop\nDistCp\ntool.\nImport required libraries\nImport the Hadoop functions and define your source and destination locations.\n%scala\r\n\r\nimport org.apache.hadoop.fs._\r\n\r\nval source = \"\"\r\nval dest = \"\"\r\n\r\ndbutils.fs.mkdirs(dest)\nBroadcast information from the driver to executors\n%scala\r\n\r\nval conf = new org.apache.spark.util.SerializableConfiguration(sc.hadoopConfiguration)\r\nval broadcastConf = sc.broadcast(conf)\r\nval broadcastDest = sc.broadcast(dest)\nCopy paths to a sequence\n%scala\r\n\r\nval filesToCopy = dbutils.fs.ls(source).map(_.path)\nParallelize the sequence and divide the workload\nHere we first get the Hadoop configuration and destination path. Then we create the path objects, before finally executing the\nFileUtil.copy\ncommand.\n%scala\r\n\r\nspark.sparkContext.parallelize(filesToCopy).foreachPartition { rows =>\r\n  rows.foreach { file =>\r\n\r\n\r\n    val conf = broadcastConf.value.value\r\n    val destPathBroadcasted = broadcastDest.value\r\n\r\n\r\n    val fromPath = new Path(file)\r\n    val toPath = new Path(destPathBroadcasted)\r\n    val fromFs = fromPath.getFileSystem(conf)\r\n    val toFs = toPath.getFileSystem(conf)\r\n\r\n\r\n    FileUtil.copy(fromFs, fromPath, toFs, toPath, false, conf)\r\n  }\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/parameter-workload_size-always-executing-small-when-using-the-databricks-agents-library-to-update-existing-model-serving-endpoints.json b/scraped_kb_articles/parameter-workload_size-always-executing-small-when-using-the-databricks-agents-library-to-update-existing-model-serving-endpoints.json new file mode 100644 index 0000000000000000000000000000000000000000..a244f4959fccf2cd08e86459ded3f17362637375 --- /dev/null +++ b/scraped_kb_articles/parameter-workload_size-always-executing-small-when-using-the-databricks-agents-library-to-update-existing-model-serving-endpoints.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/parameter-workload_size-always-executing-small-when-using-the-databricks-agents-library-to-update-existing-model-serving-endpoints", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the databricks-agents library to update an existing model serving endpoint, calling the\nagents.deploy()\nmethod with the\nworkload_size\nparameter set to any value other than\nSMALL\nstill results in deployment with workload size\nSMALL\n.\nExample\nThe following code snippet intends to redeploy a new model on the existing serving endpoint with the\nMEDIUM\nsize. However, due to this issue, the endpoint incorrectly deploys using the default size\nSMALL\n.\nagents.deploy(\r\n    ,\r\n    ,\r\n    workload_size=ServedModelInputWorkloadSize.MEDIUM\r\n)\nCause\nThe specified\nworkload_size\nparameter is not correctly applied when deploying a new agent version to the existing model-serving endpoint.\nSolution\nUpdate the databricks-agents library to version 0.17.0 or later. This version includes a resolution to the issue, ensuring the\nworkload_size\nparameter is correctly respected during deployments on existing model serving endpoints." +} \ No newline at end of file diff --git a/scraped_kb_articles/parquet-table-counts-not-being-reflected-based-on-concurrent-updates.json b/scraped_kb_articles/parquet-table-counts-not-being-reflected-based-on-concurrent-updates.json new file mode 100644 index 0000000000000000000000000000000000000000..af8395720683cf3ea0adf1275533250bab9ac16b --- /dev/null +++ b/scraped_kb_articles/parquet-table-counts-not-being-reflected-based-on-concurrent-updates.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/parquet-table-counts-not-being-reflected-based-on-concurrent-updates", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou may notice that a Parquet table count within a notebook remains the same even after additional rows are added to the table from an external process.\nFor instance, if a count is taken from a table (Table 1) in a notebook (Notebook A) and the count is 100, an outside process or another notebook updates Table 1 and adds 100 additional rows. However, if the count is taken again in Notebook A, it continues to show 100 rows.\nCause\nEach notebook has a different Apache Spark session, different instances of the Hive client, and different caches, even if two notebooks are attached to the same cluster. When Hive metastore Parquet table conversion is enabled, metadata in those converted tables are also cached.\nSolution\nManually refresh the table in the notebook where the count was initially taken. This can be done by executing a refresh table command. For more information on metadata refreshing with Spark, please review the\nSpark SQL, DataFrames and Datasets Guide\n.\nAlternatively, you can use Delta tables instead of Parquet tables to avoid needing to do a manual refresh. For more information on using Delta tables, please review\nWhat is Delta Lake?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/parquet-table-last-modification-retrieval-returns-null.json b/scraped_kb_articles/parquet-table-last-modification-retrieval-returns-null.json new file mode 100644 index 0000000000000000000000000000000000000000..f27e22c39f1be12bf36f4910e2f1a6a4943c3b50 --- /dev/null +++ b/scraped_kb_articles/parquet-table-last-modification-retrieval-returns-null.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/parquet-table-last-modification-retrieval-returns-null", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to use the\nDESCRIBE DETAIL\ncommand to retrieve the last modification data of a Parquet table, it returns\nNULL\n.\nCause\nParquet tables do not have metadata information, such as\n_delta_log\n, which is present in Delta tables and is the source used by\nDESCRIBE DETAIL\n.\nSolution\nAudit columns, add an additional column, or list files and sort to get a last modification date.\nIf your data contains audit columns like\nlastModified\nor\nlastUpdated\n, you can use a query to apply the\nmax()\nfunction on that column to find the last modified date.\nAlternatively, you can add an additional column to your Parquet table that populates using\ncurrent_timestamp()\nwhile inserting or updating records. Then, you can apply a query with the\nmax()\nfunction to get the value.\nList the files within the Parquet table path and sort them by the\nmodificationTime\ncolumn to find when the last file was added or modified." +} \ No newline at end of file diff --git a/scraped_kb_articles/parquet-timestamp-requires-msver12.json b/scraped_kb_articles/parquet-timestamp-requires-msver12.json new file mode 100644 index 0000000000000000000000000000000000000000..0d90a9cba32baee60e6554afb7832d7fb987d77a --- /dev/null +++ b/scraped_kb_articles/parquet-timestamp-requires-msver12.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/parquet-timestamp-requires-msver12", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to create a Parquet table using\nTIMESTAMP\n, but you get an error message.\nError in SQL statement: QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384\nExample code\n%sql\r\n\r\nCREATE EXTERNAL TABLE IF NOT EXISTS testTable (\r\n  emp_name STRING,\r\n  joing_datetime TIMESTAMP,\r\n)\r\nPARTITIONED BY\r\n  (date DATE)\r\nSTORED AS\r\n  PARQUET\r\nLOCATION\r\n  \"/mnt//emp.testTable\"\nCause\nParquet requires a Hive metastore version of 1.2 or above in order to use\nTIMESTAMP\n.\nDelete\nInfo\nThe default Hive metastore client version used in Databricks Runtime is 0.13.0.\nSolution\nYou must upgrade the Hive metastore client on the cluster.\nYou can do this by adding the following settings to the cluster’s\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nDatabricks Runtime 6.6 and below\nspark.sql.hive.metastore.version 1.2.1\r\nspark.sql.hive.metastore.jars builtin\nDatabricks Runtime 7.0 and above\nspark.sql.hive.metastore.jars /dbfs \r\nspark.sql.hive.metastore.version 1.2.1\nDelete\nInfo\nFor Databricks Runtime 7.0 and above you must download the metastore jars and point to them (\nAWS\n|\nAzure\n|\nGCP\n) as detailed in the Databricks documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/parquet-to-delta-fails.json b/scraped_kb_articles/parquet-to-delta-fails.json new file mode 100644 index 0000000000000000000000000000000000000000..f9bc9b076642178755b4c7d85ec48e1e2e611cb4 --- /dev/null +++ b/scraped_kb_articles/parquet-to-delta-fails.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/parquet-to-delta-fails", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to convert a Parquet file to a Delta Lake file.\nThe directory containing the Parquet file contains one or more subdirectories.\nThe conversion fails with the error message:\nExpecting 0 partition column(s): [], but found 1 partition column(s): [] from parsing the file name: ;.\nCause\nThe conversion process is attempting to process the subdirectory as a partition. This causes the error message.\nSolution\nIf you are using Databricks Runtime 7.5 or below, ensure that directories containing Parquet files do not have subdirectories.\nThis issue is resolved in Databricks Runtime 8.0 and above." +} \ No newline at end of file diff --git a/scraped_kb_articles/parse_syntax_error-error-when-using-odbc-driver-and-ms-polybase-to-execute-queries-in-databricks.json b/scraped_kb_articles/parse_syntax_error-error-when-using-odbc-driver-and-ms-polybase-to-execute-queries-in-databricks.json new file mode 100644 index 0000000000000000000000000000000000000000..3d6e5b7a502b0a8e443c48321831aa08e2cdbad6 --- /dev/null +++ b/scraped_kb_articles/parse_syntax_error-error-when-using-odbc-driver-and-ms-polybase-to-execute-queries-in-databricks.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/parse_syntax_error-error-when-using-odbc-driver-and-ms-polybase-to-execute-queries-in-databricks", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou install PolyBase and the Databricks Simba ODBC driver in your Azure environment to allow Azure Synapse Analytics, and then create a query with this setup in your Databricks environment. When you try to execute the query, you encounter a\nPARSE_SYNTAX_ERROR\nmessage.\nCause\nPolyBase handles the conversion of queries from MS SQL server syntax to Apache Spark SQL syntax. Issues during the conversion process can lead to parsing errors.\nExample\nIn the following example, you provide SQL query 1. The conversion process results in SQL query 2, which is invalid in Spark SQL syntax.\nThe issue is located in the last two lines. The segment\nON (`t0`.`fieldB` = `T_3`.`fieldB`)) `t4`\nis misplaced and should have been included after the first join.\nSQL query 1\nSELECT COUNT_big(table0.fieldA)\r\nFROM     schema.table0 t0\r\nINNER JOIN schema.table1 t1 ON t0.fieldB  = t1.fieldB\r\nINNER JOIN schema.table2 da ON t0.fieldC  = t2.fieldC\r\nWHERE     t2.fieldD >= 20241101\nSQL query 2\nSELECT `fieldA` `fieldA`\r\nFROM (SELECT (COUNT(0)) `fieldA`\r\nFROM `catalog`.`schema`.`table0` `t0`\r\nINNER JOIN `catalog`.`schema`.`table1` `t1`\r\nINNER JOIN (SELECT `t2`.`fieldB`, `t2`.`fieldC`\r\nFROM `schema`.`catalog`.`table` `t2`\r\nWHERE (`t2`.`fieldD` >= (20241001))) `t3`\r\nON (`t3`.`fieldA` = `T_1`.`fieldA`)\r\nON (`t0`.`fieldB` = `T_3`.`fieldB`)) `t4`\nSolution\nThe error stems from the Microsoft side and requires their support team to assist.\nContact the Microsoft technical support team for help retrieving the ODBC driver logs from your Azure environment.\nLocate the PARSE_SYNTAX_ERROR in the logs. The converted query appears next to the error message.\nReport the issue to Microsoft technical support to fix the problem. Send your original query together with the error message that contains the invalid generated query." +} \ No newline at end of file diff --git a/scraped_kb_articles/parsing-post-meridiem-time-pm-with-to_timestamp-returns-null.json b/scraped_kb_articles/parsing-post-meridiem-time-pm-with-to_timestamp-returns-null.json new file mode 100644 index 0000000000000000000000000000000000000000..fd54e7cf6d8472a9a3b25ba89ff84e36fd6f0fad --- /dev/null +++ b/scraped_kb_articles/parsing-post-meridiem-time-pm-with-to_timestamp-returns-null.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/parsing-post-meridiem-time-pm-with-to_timestamp-returns-null", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to parse a 12-hour (AM/PM) time value with\nto_timestamp()\n, but instead of returning a 24-hour time value it returns null.\nFor example, this sample code:\n%sql\r\n\r\nSELECT to_timestamp('2016-12-31 10:12:00 PM', 'yyyy-MM-dd HH:mm:ss a');\nReturns null when run:\nCause\nto_timestamp()\nrequires the hour format to be in lowercase.\nIf the hour format is in capital letters,\nto_timestamp()\nreturns null.\nSolution\nMake sure you specify the hour format in lowercase letters.\nFor example, this sample code:\n%sql\r\n\r\nSELECT to_timestamp('2016-12-31 10:12:00 PM', 'yyyy-MM-dd hh:mm:ss a');\nReturns the time as a 24-hour time value." +} \ No newline at end of file diff --git a/scraped_kb_articles/pass-arguments-to-a-notebook-as-a-list.json b/scraped_kb_articles/pass-arguments-to-a-notebook-as-a-list.json new file mode 100644 index 0000000000000000000000000000000000000000..bdfa91c8f6b0923c02d0d0b88968949b4cd4631d --- /dev/null +++ b/scraped_kb_articles/pass-arguments-to-a-notebook-as-a-list.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/pass-arguments-to-a-notebook-as-a-list", + "title": "Título do Artigo Desconhecido", + "content": "There is no direct way to pass arguments to a notebook as a dictionary or list.\nYou can work around this limitation by serializing your list as a JSON file and then passing it as one argument.\nAfter passing the JSON file to the notebook, you can parse it with\njson.loads()\n.\nInstructions\nDefine the argument list and convert it to a JSON file.\nStart by defining your argument list.\nargs = {'arg_1': 'a', 'arg_2': 'b', 'arg_3': 'c', 'arg_4': 'd', 'arg_5': 'e'}\nDeclare the arguments as a dictionary.\nUse\njson.dumps()\nto convert the dictionary to JSON.\nWrite the converted JSON to a file.\n%python\r\n\r\nimport json\r\n\r\nargs = {\"arg_1\": \"a\", \"arg_2\": \"b\", \"arg_3\": \"c\", \"arg_4\": \"d\", \"arg_5\": \"e\"}\r\n\r\nfile = open(\"arguments.txt\", \"w\")\r\nfile.write(json.dumps(args))  \r\nfile.close() \r\n\r\nprint(\"Argument file is created.\")\nOnce the argument file is created, you can open it inside a notebook and use the arguments.\nOpen the JSON file that contains the arguments.\nRead the contents.\nUse\njson.loads()\nto convert the JSON back into a dictionary.\nNow the arguments are available for use inside the notebook.\n%python\r\n\r\nimport json\r\n\r\nfile = open(\"arguments.txt\", \"r\")\r\ndata = file.read()\r\nfile.close()\r\n\r\nargs = json.loads(data)\r\nprint(args)\nYou can use this example code as a base for your own argument lists." +} \ No newline at end of file diff --git a/scraped_kb_articles/paths-behave-differently-on-git-folders-and-workspace-folders.json b/scraped_kb_articles/paths-behave-differently-on-git-folders-and-workspace-folders.json new file mode 100644 index 0000000000000000000000000000000000000000..fe728b4107802b92f7f81673f4e08098047e5729 --- /dev/null +++ b/scraped_kb_articles/paths-behave-differently-on-git-folders-and-workspace-folders.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/paths-behave-differently-on-git-folders-and-workspace-folders", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you run the same code on a\nGit folder\n(\nAWS\n|\nAzure\n|\nGCP\n) vs. a\nworkspace folder\n(\nAWS\n|\nAzure\n|\nGCP\n), the code fails with a\nModuleNotFoundError: No module named 'some_folder'\nerror message.\nThis can cause confusion, especially if you are debugging code from a Git project outside of the Git project.\nCause\nThe error occurs because imports for Git folders can reference the project’s root. When using workspace folders, the root references the current working directory.\nExample\n| some_git_folder (GIT)\r\n| -- some_module\r\n| ---- some_component.py\r\n| -- some_other_folder\r\n| ---- notebook.ipynb\r\n| -- some_file.json\r\n...\r\n| some_standard_folder\r\n| -- some_module\r\n| ---- some_component.py\r\n| -- some_other_folder\r\n| ---- notebook.ipynb\r\n| -- some_file.json\nIn this example, we have a module present under\nsome_folder\n, a notebook under\nsome_other_folder\nand a file present on the root.\nWe mirrored the content across the folders.\nThe file contents are as follows:\nsome_component.py:\n%python\r\n\r\nfrom dataclasses import dataclass\r\n\r\n@dataclass\r\nclass MyModuleClass:\r\n    my_string: str\r\n    my_int: int\r\n\r\n    def __repr__(self):\r\n        return f\"{self.my_string} - {self.my_int}\"\nnotebook.ipynb:\n%python\r\n\r\n# cell [1]\r\nfrom some_module.some_component import MyModuleClass\r\n\r\ndata = MyModuleClass(\"John\", 52)\r\nprint(data)\r\n\r\n# cell [2]\r\nimport json\r\n\r\nwith open('some_file.json') as f:\r\n    data_json = json.loads(f.read())\r\n\r\nprint(data_json)\nIn this example the notebook outputs under the Git folder.\nOutput for\nnotebook.ipynb\nfor a Git folder.\ncell [1] Output\r\nJohn - 52\r\n\r\ncell [2] Output\r\n{'some_key': 'some_value'}\nWhen the code is run under the workspace folder, the output fails for the first cell and succeed for the second cell.\nOutput for\nnotebook.ipynb\nfor a standard folder.\ncell [1] Output\r\nModuleNotFoundError: No module named 'some_folder'\r\n\r\ncell [2] Output\r\n{'some_key': 'some_value'}\nThis happens because the root folder is different across both locations, even though the notebook code is the same.\nSolution\nYou can resolve this issue in two ways.\nMove the underlying module files so the necessary files are referenced in the same directory as the code is located.\nMove your notebook to the root directory and refactor your code accordingly." +} \ No newline at end of file diff --git a/scraped_kb_articles/pattern-match-files-in-path.json b/scraped_kb_articles/pattern-match-files-in-path.json new file mode 100644 index 0000000000000000000000000000000000000000..a2c6af16a6c6e05e4dc5a09566ad04df9a6a462b --- /dev/null +++ b/scraped_kb_articles/pattern-match-files-in-path.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/pattern-match-files-in-path", + "title": "Título do Artigo Desconhecido", + "content": "When selecting files, a common requirement is to only read specific files from a folder.\nFor example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression.\nThis article uses example patterns to show you how to read specific files from a sample list.\nSample files\nAssume that the following files are located in the root folder.\n//root/1999.txt\r\n//root/2000.txt\r\n//root/2001.txt\r\n//root/2002.txt\r\n//root/2003.txt\r\n//root/2004.txt\r\n//root/2005.txt\r\n//root/2020/04.txt\r\n//root/2020/05.txt\nGlob patterns\nAsterisk\n*\n- The asterisk matches one or more characters. It is a wild card for multiple characters.\nThis example matches all files with a\n.txt\nextension\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/*.txt\"))\nQuestion mark\n?\n- The question mark matches a single character. It is a wild card that is limited to replacing a single character.\nThis example matches all files from the root folder, except 1999.txt. It does not search the contents of the 2020 folder.\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/200?.txt\"))\nCharacter class\n[ab]\n- The character class matches a single character from the set. It is represented by the characters you want to match inside a set of brackets.\nThis example matches all files with a 2 or 3 in place of the matched character. It returns\n2002.txt\nand\n2003.txt\nfrom the sample files.\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/200[23].txt\"))\nNegated character class\n[^ab]\n- The negated character class matches a single character that is not in the set. It is represented by the characters you want to exclude inside a set of brackets.\nThis example matches all files except those with a 2 or 3 in place of the matched character. It returns\n2000.txt\n,\n2001.txt\n,\n2004.txt\n, and\n2005.txt\nfrom the sample files.\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/200[^23].txt\"))\nCharacter range\n[\na-b]\n- The character class matches a single character in the range of values. It is represented by the range of characters you want to match inside a set of brackets.\nThis example matches all files with a character within the search range in place of the matched character. It returns\n2002.txt\n,\n2003.txt\n,\n2004.txt\n, and\n2005.txt\nfrom the sample files.\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/200[2-5].txt\"))\nNegated character range\n[^a-b]\n- The negated character class matches a single character that is not in the range of values. It is represented by the range of characters you want to exclude inside a set of brackets.\nThis example matches all files with a character outside the search range in place of the matched character. It returns\n2000.txt\nand\n2001.txt\nfrom the sample files.\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/200[^2-5].txt\"))\nAlternation\n{a,b}\n- Alternation matches either expression. It is represented by the expressions you want to match inside a set of curly brackets.\nThis example matches all files with an expression that matches one of the two selected expressions. It returns\n2004.txt\nand\n2005.txt\nfrom the sample files.\n%scala\r\n\r\ndisplay(spark.read.format(\"text\").load(\"//root/20{04, 05}.txt\"))" +} \ No newline at end of file diff --git a/scraped_kb_articles/performance-degradation.json b/scraped_kb_articles/performance-degradation.json new file mode 100644 index 0000000000000000000000000000000000000000..3137f949aa0cde4c5f2f7d6f887c32f5c9d7a2d6 --- /dev/null +++ b/scraped_kb_articles/performance-degradation.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/performance-degradation", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a streaming job which has its performance degrade over time.\nYou start a new streaming job with the same configuration and same source, and it performs better than the existing job.\nCause\nIssues with old checkpoints can result in performance degradation in long running streaming jobs.\nThis can happen if the job was intermittently halted and restarted from the same checkpoint.\nYou can validate the issue by reviewing the latest micro batch offset sequence number.\nSolution\nChange the checkpoint directory.\nAvoid restarting old streaming jobs with the same checkpoint directories.\nIf you cannot change the checkpoint directory, increase the cluster capacity." +} \ No newline at end of file diff --git a/scraped_kb_articles/performing-count-on-delta-table-using-dedicated-vs-standard-compute.json b/scraped_kb_articles/performing-count-on-delta-table-using-dedicated-vs-standard-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..c011c752309e06f5db50184d557eecba6232f00d --- /dev/null +++ b/scraped_kb_articles/performing-count-on-delta-table-using-dedicated-vs-standard-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/performing-count-on-delta-table-using-dedicated-vs-standard-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use a standard (formerly shared) compute to perform a count on a Delta table version, the count fails with the following error.\nError while reading file :///////.  [DELTA_FILE_NOT_FOUND_DETAILED] File\nCause\nA standard compute reads data from the Parquet files. When data files are removed from a table version, it cannot see the required data files to perform the count.\nBy contrast, a dedicated (formerly single-user) compute reads data from the transaction logs JSON file inside the\n_delta_logs\ndirectory, which records every change made to a Delta table. It has access to the changes required to perform the count on the table version.\nSolution\nYou can safely switch to a dedicated compute and rerun the count operation on the Delta table version with removed files." +} \ No newline at end of file diff --git a/scraped_kb_articles/permission-denied-error-when-trying-to-run-vacuum-command-on-unity-catalog-table-with-dedicated-compute-formerly-single-user-cluster.json b/scraped_kb_articles/permission-denied-error-when-trying-to-run-vacuum-command-on-unity-catalog-table-with-dedicated-compute-formerly-single-user-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..867442075e89bb1ca44ef4c95f8f6150a433e313 --- /dev/null +++ b/scraped_kb_articles/permission-denied-error-when-trying-to-run-vacuum-command-on-unity-catalog-table-with-dedicated-compute-formerly-single-user-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/permission-denied-error-when-trying-to-run-vacuum-command-on-unity-catalog-table-with-dedicated-compute-formerly-single-user-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using dedicated compute (formerly single-user cluster) to run the command\nVACUUM ..\non a table in Unity Catalog, your query fails with the following error.\n\\ERROR SQLDriverLocal: Error in SQL query: VACUUM `catalog-name`.`schema-name`.`table-name`\r\ncom.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: Catalog 'catalog-name' is not accessible in current workspace\nCause\nThe table has been created using a shallow clone of an existing source table in the workspace. You may not have the required permissions on the cloned table, and the cloned table may not be available in your workspace.\nShallow clones reference the same underlying data files as the source table, rather than creating a separate copy. When running\nVACUUM\non the source table, Databricks validates all references, including cloned tables. Dedicated compute requires explicit access to both the source and cloned tables due to stricter security enforcement.\nSolution\nDatabricks recommends using shared compute (formerly shared clusters). Shared compute enforces permissions at runtime, avoiding additional access checks on cloned tables.\nIf you continue to use dedicated compute, ensure you have the required read/write permissions and the cloned table is accessible. Use the following commands to grant access.\nGRANT USE_CATALOG `` TO ``;\r\nGRANT USE_SCHEMA ``.`` TO ``;\r\nGRANT SELECT, MODIFY ON TABLE ``.``.`` TO ``;\nIf the catalog resides in a different workspace, ensure that workspace-catalog binding is disabled. For more details, refer to the\nLimit catalog access to specific workspaces\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/permission_denied-error-when-creating-secret-scope.json b/scraped_kb_articles/permission_denied-error-when-creating-secret-scope.json new file mode 100644 index 0000000000000000000000000000000000000000..9275f9ab0620a5aef9aa741c7d1f48aab8a76589 --- /dev/null +++ b/scraped_kb_articles/permission_denied-error-when-creating-secret-scope.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/permission_denied-error-when-creating-secret-scope", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating a secret scope in Databricks, you encounter the following error on the specified Azure Key Vault.\n“com.databricks.common.client.DatabricksServiceHttpClientException: PERMISSION_DENIED:Invalid permissions on the specified KeyVault https:/xxxxxxxxxx.vault.azure.net/.Wrapped Message: Status code 403, {\"error\":{\"code\":\"Forbidden\",\"message\":\"Client address is not authorized and caller was ignored because bypass is set to None\\r\\nClient address: \\r\\nCaller: name=AzureDatabricks;appid=;oid=\\r\\nVault: ;location=japaneast\",\"innererror\":{\"code\":\"ForbiddenByFirewall\"}}}”\nCause\nThe Databricks control plane uses the control plane network address translation (NAT) IP for communicating with external resources like Azure Key Vault. Azure Key Vault's firewall settings, however, do not allow the control plane NAT IP, resulting in a\n403 Forbidden\nerror.\nAdditionally, accessing Azure Key Vault uses the control plane NAT IP even when secure cluster connectivity (SCC) is enabled. For more information, refer to the\nEnable secure cluster connectivity\ndocumentation.\nSolution\nAdjust the Azure Key Vault's firewall settings to either allow the Databricks control plane NAT IP or configure the Azure Key Vault to allow trusted Microsoft services to bypass the firewall.\nAllow the control plane NAT IP\nIdentify the control plane NAT IP address from the error message.\nGo to your\nAzure Key Vault\nresource in the Azure portal.\nNavigate to\nFirewalls and virtual networks\n.\nUnder\nAllow access from:\n, select\n\"Allow public access from specific virtual networks and IP addresses\"\n.\nAdd the control plane NAT IP address to the list of allowed IP addresses.\nAllow trusted Microsoft services to bypass the firewall\nNavigate to the\nFirewalls and virtual networks\nsettings of your Azure Key Vault.\nUnder\nException\n, check the box for\n\"Allow trusted Microsoft services to bypass this firewall\"\n. This enables services like Azure Databricks to access the Key Vault even if their IP is not explicitly allowlisted.\nFor more information, refer to the “Configure your Azure Key Vault instance for Azure Databricks” section of the\nSecret Management\ndocumentation and the “Azure Databricks control plane addresses” section of the\nIP addresses and domains for Azure Databricks services and assets\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/permission_denied-error-while-running-automl-experiment-with-group-assigned-cluster.json b/scraped_kb_articles/permission_denied-error-while-running-automl-experiment-with-group-assigned-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..4f7ff9af8720d11bf1d1110a76d90d6beba3b8eb --- /dev/null +++ b/scraped_kb_articles/permission_denied-error-while-running-automl-experiment-with-group-assigned-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/permission_denied-error-while-running-automl-experiment-with-group-assigned-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile trying to create an AutoML experiment with a dedicated access mode cluster assigned to a group, you encounter the following error.\nAutomlServiceError: Failed to create Automl experiment. Status Code: 403: Error: b'{\"error_code\": \"PERMISSION_DENIED\", \"message\": \"unknown does not have View permissions on . Please contact the owner or an administrator for access.\"}’\nCause\nWhen using a group cluster in Databricks, user permissions are scoped to the group. This means all actions on the cluster are performed using the group’s permissions, not the individual user’s.\nAs a result, even if a user personally has access to a folder (like\n/Users/\n), the cluster cannot access it unless the group also has permission. This mismatch causes the AutoML experiment to fail with a\nPERMISSION_DENIED\nerror.\nSolution\nCreating a dedicated folder such as\n/Workspace/Groups/\nand giving the group\nCAN MANAGE\npermissions ensures the group has full access to read from and write to that location.\nCreate a\n/Workspace/Groups/\nfolder for the group you plan to use with the group cluster.\nAssign\nCAN MANAGE\npermissions on the folder to the group.\nAdd\nexperiment_dir = “/Workspace/Groups/”\nto your AutoML code.\nMake sure that whatever group you are using has the\n\"Workspace Access\"\nentitlement enabled.\nFor more information on best practices for managing group clusters, review the\nAssign compute resources to a group\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/permission_denied-error-while-trying-to-start-a-sql-warehouse.json b/scraped_kb_articles/permission_denied-error-while-trying-to-start-a-sql-warehouse.json new file mode 100644 index 0000000000000000000000000000000000000000..d49bcffb6e32e3be501572980552d341c7d76117 --- /dev/null +++ b/scraped_kb_articles/permission_denied-error-while-trying-to-start-a-sql-warehouse.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/permission_denied-error-while-trying-to-start-a-sql-warehouse", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to start a SQL warehouse you encounter the following error.\nRequest to create a cluster failed with an exception: PERMISSION_DENIED: Principal is not part of org: .\nCause\nThe SQL warehouse is still owned by a user who is no longer part of the workspace.\nSince Databricks requires a valid and active user to manage warehouse ownership, it fails to initialize the cluster when that user is missing, resulting in the particular\nPERMISSION_DENIED\nerror.\nSolution\nAssign a new owner to the SQL warehouse.\nSelect the affected warehouse.\nNavigate to\nPermissions > Settings > Assign New Owner\nChoose a new owner who is either a workspace admin or a user with unrestricted cluster creation permissions.\nSave your changes and restart the warehouse.\nFor more information, refer to the\nCreate a SQL warehouse\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/permissions-error-when-accessing-unity-catalog-tables.json b/scraped_kb_articles/permissions-error-when-accessing-unity-catalog-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..4ff434a4911a9cd22a0a3992388a78fefe9631b1 --- /dev/null +++ b/scraped_kb_articles/permissions-error-when-accessing-unity-catalog-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/permissions-error-when-accessing-unity-catalog-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to access Unity Catalog tables using a single-user cluster on a Databricks Runtime version below 15.4 LTS, you encounter a permission error.\n: java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: User is missing required privileges to access this table from an Assigned cluster. Please try with a SHARED cluster instead.\nCause\nYou’re attempting to access the table without the required privileges, and the single-user cluster does not support the necessary access control. Fine-grained access control (FGAC) is not available for Unity Catalog on single-user clusters using Databricks Runtime versions below 15.4 LTS.\nSolution\nThere are three options available to resolve the issue.\nUse a shared cluster or warehouse instead of a single-user cluster.\nUse a serverless compute option.\nKeep the single-user cluster, but upgrade your Databricks Runtime version to 15.4 LTS or above.\nTo change the Databricks Runtime version:\nGo to your Databricks workspace and navigate to\nCompute\n.\nSelect the cluster to update.\nIn the\nDatabricks Runtime Version\nsection, select a version that is 15.4 or above.\nClick\nUpdate\nto apply the changes.\nPreventive measures\nEnsure you’re using a compatible Databricks Runtime version when working with Unity Catalog in single-user access mode.\nVerify you have the necessary privileges to access tables you need." +} \ No newline at end of file diff --git a/scraped_kb_articles/permissions-error-when-trying-to-run-job-clusters-.json b/scraped_kb_articles/permissions-error-when-trying-to-run-job-clusters-.json new file mode 100644 index 0000000000000000000000000000000000000000..ddc274ab15df3a44ef6d63a371c7d60b14bf7a5f --- /dev/null +++ b/scraped_kb_articles/permissions-error-when-trying-to-run-job-clusters-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/permissions-error-when-trying-to-run-job-clusters-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile attempting to run job clusters using a service principal, you receive the error:\nYou cannot set the job's identity to because you do not have the required permissions. Please contact your workspace administrator or the user who manages the service principal.\nAdditionally, you may see\nPERMISSION_DENIED: Please contact your administrator\n.\nCause\nThe job clusters are configured to run as a service principal, but the necessary permissions are not correctly set.\nSolution\nEnsure that the service principal has the '\nService Principal User\n' role.\nExplicitly assign yourself the service principal user role, even after creating the service principal. (Manager does not automatically inherit User permissions.)\nIndicate '\ncan use\n' permission on the cluster policy.\nFor additional information, please refer to the\nRoles for managing service principals\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/permissions-issue-when-trying-to-access-dlt-managed-streaming-tables-and-materialized-views-in-unity-catalog.json b/scraped_kb_articles/permissions-issue-when-trying-to-access-dlt-managed-streaming-tables-and-materialized-views-in-unity-catalog.json new file mode 100644 index 0000000000000000000000000000000000000000..f0d8c6753af21924bd22784fb2ed6f54e0c79204 --- /dev/null +++ b/scraped_kb_articles/permissions-issue-when-trying-to-access-dlt-managed-streaming-tables-and-materialized-views-in-unity-catalog.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/permissions-issue-when-trying-to-access-dlt-managed-streaming-tables-and-materialized-views-in-unity-catalog", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to query Delta Live Tables-managed materialized views and streaming tables in Unity Catalog outside of a DLT Pipeline, you encounter a permissions issue.\nCause\nOnly a subset of compute types can fully honor the fine-grained access control supported by Unity Catalog:\nShared\naccess mode (including serverless compute) and\nSingle User\n(also known as\nAssigned\n) access mode with fine-grained access control enabled.\nSolution\nGrant Unity Catalog\nUSE_CATALOG\nand\nUSE_SCHEMA\npermissions on the catalog and schema, respectively.\nGrant\nSELECT\npermissions on the materialized view and/or streaming table to the user or service principal identity.\nApply your choice of compute cluster to that identity which can fully honor those permissions.\nFor more information, review the\nUse Unity Catalog with your Delta Live Tables pipelines\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nFine-grained access control on Single User compute\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/persist-metrics-csv-sink-dbfs.json b/scraped_kb_articles/persist-metrics-csv-sink-dbfs.json new file mode 100644 index 0000000000000000000000000000000000000000..812c56ae511ecb02f94a762d7107ad84508e4296 --- /dev/null +++ b/scraped_kb_articles/persist-metrics-csv-sink-dbfs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/persist-metrics-csv-sink-dbfs", + "title": "Título do Artigo Desconhecido", + "content": "Spark has a configurable\nmetrics system\nthat supports a number of sinks, including CSV files.\nIn this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location.\nCreate an init script\nAll of the configuration is done in an init script.\nThe init script does the following three things:\nConfigures the cluster to generate CSV metrics on both the driver and the worker.\nWrites the CSV metrics to a temporary, local folder.\nUploads the CSV metrics from the temporary, local folder to the chosen DBFS location.\nDelete\nNote\nThe CSV metrics are saved locally before being uploaded to the DBFS location because DBFS is not designed for a large number of random writes.\nCustomize the sample code and then run it in a notebook to create an init script on your cluster.\nSample code to create an init script:\n%python\r\n\r\ndbutils.fs.put(\"//metrics.sh\",\"\"\"\r\n#!/bin/bash\r\nmkdir /tmp/csv\r\nsudo bash -c \"cat <> /databricks/spark/dbconf/log4j/master-worker/metrics.properties\r\n*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink\r\nspark.metrics.staticSources.enabled true\r\nspark.metrics.executorMetricsSource.enabled true\r\nspark.executor.processTreeMetrics.enabled true\r\nspark.sql.streaming.metricsEnabled true\r\nmaster.source.jvm.class org.apache.spark.metrics.source.JvmSource\r\nworker.source.jvm.class org.apache.spark.metrics.source.JvmSource\r\n*.sink.csv.period 5\r\n*.sink.csv.unit seconds\r\n*.sink.csv.directory /tmp/csv/\r\nworker.sink.csv.period 5\r\nworker.sink.csv.unit seconds\r\nEOF\"\r\n\r\nsudo bash -c \"cat <> /databricks/spark/conf/metrics.properties\r\n*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink\r\nspark.metrics.staticSources.enabled true\r\nspark.metrics.executorMetricsSource.enabled true\r\nspark.executor.processTreeMetrics.enabled true\r\nspark.sql.streaming.metricsEnabled true\r\ndriver.source.jvm.class org.apache.spark.metrics.source.JvmSource\r\nexecutor.source.jvm.class org.apache.spark.metrics.source.JvmSource\r\n*.sink.csv.period 5\r\n*.sink.csv.unit seconds\r\n*.sink.csv.directory /tmp/csv/\r\nworker.sink.csv.period 5\r\nworker.sink.csv.unit seconds\r\nEOF\"\r\n\r\ncat <<'EOF' >> /tmp/asynccode.sh\r\n#!/bin/bash\r\nDB_CLUSTER_ID=$(echo $HOSTNAME | awk -F '-' '{print$1\"-\"$2\"-\"$3}')\r\nMYIP=$(hostname -I)\r\nif [[ ! -d /dbfs//${DB_CLUSTER_ID}/metrics-${MYIP} ]] ; then\r\nsudo mkdir -p /dbfs//${DB_CLUSTER_ID}/metrics-${MYIP}\r\nfi\r\nwhile true; do\r\n    if [ -d \"/tmp/csv\" ]; then\r\n        sudo cp -r /tmp/csv/* /dbfs//$DB_CLUSTER_ID/metrics-$MYIP\r\n  fi\r\n  sleep 5\r\ndone\r\nEOF\r\nchmod a+x /tmp/asynccode.sh\r\n/tmp/asynccode.sh & disown\r\n\"\"\", True)\nReplace\n\nwith the DBFS location you want to use to save the init script.\nReplace\n\nwith the DBFS location you want to use to save the CSV metrics.\nCluster-scoped init script\nOnce you have created the init script on your cluster, you must configure it as a\ncluster-scoped init script\n.\nVerify that CSV metrics are correctly written\nRestart your cluster and run a sample job.\nCheck the DBFS location that you configured for CSV metrics and verify that they were correctly written." +} \ No newline at end of file diff --git a/scraped_kb_articles/persist-share-code-rstudio.json b/scraped_kb_articles/persist-share-code-rstudio.json new file mode 100644 index 0000000000000000000000000000000000000000..880dd2c0d1c75a593ffa4acec14cf6487da01bf0 --- /dev/null +++ b/scraped_kb_articles/persist-share-code-rstudio.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/persist-share-code-rstudio", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nUnlike a Databricks notebook that has version control built in, code developed in RStudio is lost when the high concurrency cluster hosting Rstudio is shut down.\nSolution\nTo persist and share code in RStudio, do one of the following:\nFrom RStudio, save the code to a folder on DBFS which is accessible from both Databricks notebooks and RStudio.\nUse the integrated support for version control like Git in RStudio.\nSave the R notebook to your local file system by exporting it as\nRmarkdown\n, then import the file into the RStudio instance.\nThe blog\nSharing R Notebooks using RMarkdown\ndescribes the steps in more detail.\nThis process allows you to persist code developed in RStudio and share notebooks between the Databricks notebook environment and RStudio." +} \ No newline at end of file diff --git a/scraped_kb_articles/photon-memory-issue-while-querying-a-table-with-a-large-number-of-columns.json b/scraped_kb_articles/photon-memory-issue-while-querying-a-table-with-a-large-number-of-columns.json new file mode 100644 index 0000000000000000000000000000000000000000..551d4174735a0bc1e8e118716549414428756673 --- /dev/null +++ b/scraped_kb_articles/photon-memory-issue-while-querying-a-table-with-a-large-number-of-columns.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/photon-memory-issue-while-querying-a-table-with-a-large-number-of-columns", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile using a\nCREATE TABLE AS SELECT\n(CTAS) SQL statement on a Delta table, you receive the following error.\nPhoton out of memory error query id: \nCause\nThe table has more columns than what Photon’s architecture is designed to optimally handle.\nSolution\nIn Databricks 15.3 and above, you can modify your CTAS SQL statement to create a table with one variant column (of type VARIANT) containing all the data.\nCREATE TABLE AS SELECT PARSE_JSON() FROM \nFor more information, please refer to the\nVARIANT type\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation and the\nQuery variant data\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAlternatively, you can disable Photon. Run the following command in a notebook. Be aware that disabling may impact performance.\nset enable_photon = false;" +} \ No newline at end of file diff --git a/scraped_kb_articles/photon-ran-out-of-memory-while-executing-query.json b/scraped_kb_articles/photon-ran-out-of-memory-while-executing-query.json new file mode 100644 index 0000000000000000000000000000000000000000..7209ddfd3238040c9ed4137cc9fa84d39ab17209 --- /dev/null +++ b/scraped_kb_articles/photon-ran-out-of-memory-while-executing-query.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/photon-ran-out-of-memory-while-executing-query", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile working with Databricks Photon clusters, especially when dealing with large datasets and complex queries, you encounter an issue where the cluster fails to complete a request and gives an error message.\nError:\r\nexception - An error occurred while calling o3085.javaToPython.\r\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 1931.0 failed 4 times, most recent failure: Lost task 120.3 in stage 1931.0 (TID 17240) (10.68.146.101 executor 9): org.apache.spark.memory.SparkOutOfMemoryError: Photon ran out of memory while executing this query.\r\nPhoton failed to reserve 2.0 MiB for rows, in partition 0, in PartitionedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=XXX).\r\nMemory usage:\r\nTotal task memory (including non-Photon): 6.3 GiB\r\nBroadcastHashedRelation(spark_plan_id=171317): allocated 6.2 GiB, tracked 6.2 GiB, untracked allocated 0.0 B, peak 6.2 GiB\r\n BuildHashedRelation: allocated 6.2 GiB, tracked 6.2 GiB, untracked allocated 0.0 B, peak 6.2 GiB\r\n  PartitionedRelation: allocated 6.2 GiB, tracked 6.2 GiB, untracked allocated 0.0 B, peak 6.2 GiB\r\n   partition 0: allocated 6.2 GiB, tracked 6.2 GiB, untracked allocated 0.0 B, peak 6.2 GiB\r\n    rows: allocated 5.5 GiB, tracked 5.5 GiB, untracked allocated 0.0 B, peak 5.5 GiB\r\n    var-len data: allocated 664.0 MiB, tracked 664.0 MiB, untracked allocated 0.0 B, peak 664.0 MiB\nCause\nYour query has run out of memory during execution, specifically when using the\nBuildHashedRelation\nand\nPartitionedRelation\nfunctions.\nRunning out of memory happens when memory is improperly allocated during query execution. The Photon cluster relies on accurate table statistics to optimize query execution and manage memory usage. When the statistics are incorrect, Photon may allocate insufficient memory for the query, resulting in an\nOut of Memory\nerror.\nAdditionally, memory management issues can occur when:\nQueries have multiple joins, subqueries, or aggregations, which increases the complexity of memory management. This makes it more challenging for Photon to accurately estimate memory needs.\nYou’re working with large datasets, which increases the likelihood of encountering an out-of-memory error. Photon may underestimate the memory required to process the data.\nYou work with dependencies such as outdated libraries or incompatible versions, which also contribute to memory management problems in Photon.\nSolution\nEnsure that all tables involved in the query have up-to-date statistics. Execute\nANALYZE TABLE COMPUTE STATISTICS\non each table to recompute and update the statistics.\nANALYZE TABLE  COMPUTE STATISTICS;\nIf possible, simplify complex queries by breaking them down into smaller, more manageable parts. This can help Photon better estimate memory requirements and reduce the likelihood of an out-of-memory error.\nUpgrade to Databricks Runtime 13.3 LTS or above. There is a new feature added to Databricks Runtime versions starting with 13.3 LTS that helps mitigate this issue." +} \ No newline at end of file diff --git a/scraped_kb_articles/pin-cluster-configurations-using-the-api.json b/scraped_kb_articles/pin-cluster-configurations-using-the-api.json new file mode 100644 index 0000000000000000000000000000000000000000..549748bf829c89b7eb894643be7377f91344eae2 --- /dev/null +++ b/scraped_kb_articles/pin-cluster-configurations-using-the-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/pin-cluster-configurations-using-the-api", + "title": "Título do Artigo Desconhecido", + "content": "Normally, cluster configurations are automatically deleted 30 days after the cluster was last terminated.\nIf you want to keep specific cluster configurations, you can pin them. Up to 100 clusters can be pinned.\nPinned clusters are not automatically deleted, however they can be manually deleted.\nDelete\nInfo\nYou must be a Databricks administrator to pin a cluster.\nYou can easily pin a cluster (\nAWS\n|\nAzure\n|\nGCP\n) via the workspace UI, but if you are managing your clusters via the API, you can also use the Pin endpoint (\nAWS\n|\nAzure\n|\nGCP\n) in the Clusters API.\nInstructions\nPin all unpinned clusters\nDelete\nInfo\nA maximum of 100 clusters can be pinned. If you attempt to \"pin all\" via the API, the clusters are pinned based on the order in which they are listed in the response. Once a total of 100 clusters are pinned, no more can be pinned until some are removed.\nUse the following sample code to pin all unpinned clusters in your workspace.\nBefore running the sample code, you will need a personal access token (\nAWS\n|\nAzure\n|\nGCP\n) and your workspace domain. The workspace domain is just the domain name.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nRun the cell to pin all unpinned clusters in your workspace.\n%python\r\n\r\nimport requests\r\nworkspace_url = \"\"\r\naccess_token = \"\"\r\n\r\nurl = workspace_url + \"/api/2.0/clusters/list\"\r\n\r\nheaders = {\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\ncluster = requests.request(\"GET\", url, headers=headers).json()\r\nfor unpinned in cluster[\"clusters\"]:\r\n if not 'pinned_by_user_name' in unpinned :\r\n print(\"Pinning\"+\" , \"+ unpinned[\"default_tags\"]['ClusterName'])\r\n url = workspace_url + \"/api/2.0/clusters/pin\"\r\n requests.post(url,json={\"cluster_id\" : unpinned[\"cluster_id\"]},headers=headers)\nPin a cluster by name\nUse the following sample code to pin a specific cluster in your workspace.\nBefore running the sample code, you will need a personal access token and your workspace domain. The workspace domain is just the domain name.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nUpdate the\n\nvalue with the name of the cluster you want to pin.\nRun the cell to pin the selected cluster in your workspace.\n%python\r\n\r\nimport requests\r\nworkspace_url = \"\"\r\naccess_token = \"\"\r\n\r\nurl = workspace_url + \"/api/2.0/clusters/list\"\r\n\r\nheaders = {\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\ncluster = requests.request(\"GET\", url, headers=headers).json()\r\n\r\nfor unpinned in cluster[\"clusters\"]:\r\n if not 'pinned_by_user_name' in unpinned :\r\n if unpinned[\"default_tags\"]['ClusterName'] == \"\" :\r\n     print(\"Pinning\"+\" , \"+ unpinned[\"default_tags\"]['ClusterName'])\r\n     url = workspace_url + \"/api/2.0/clusters/pin\"\r\n     requests.post(url,json={\"cluster_id\" : unpinned[\"cluster_id\"]},headers=headers)\nPin all clusters by a specific user\nUse the following sample code to pin a specific cluster in your workspace.\nBefore running the sample code, you will need a personal access token and your workspace domain. The workspace domain is just the domain name.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nUpdate the\n\nvalue with the name of the user whose clusters you want to pin.\nRun the cell to pin the selected clusters in your workspace.\n%python\r\n\r\nimport requests\r\nworkspace_url = \"\"\r\naccess_token = \"\"  \r\n\r\nurl = workspace_url + \"/api/2.0/clusters/list\"\r\n\r\nheaders={\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\ncluster = requests.request(\"GET\", url, headers=headers).json()\r\n\r\nfor unpinned in cluster[\"clusters\"]:\r\n     if not 'pinned_by_user_name' in unpinned :\r\n        if unpinned[\"creator_user_name\"] == \"\" :\r\n            url = shard_url + \"/api/2.0/clusters/pin\"\r\n            requests.post(url,json={\"cluster_id\" : unpinned[\"cluster_id\"]},headers=headers)\r\n            print(\"Pinning\"+\" , \"+ unpinned[\"default_tags\"]['ClusterName'])" +} \ No newline at end of file diff --git a/scraped_kb_articles/pin-r-packages.json b/scraped_kb_articles/pin-r-packages.json new file mode 100644 index 0000000000000000000000000000000000000000..838b6d00870365b8a984fa1a01f18c90d0053c57 --- /dev/null +++ b/scraped_kb_articles/pin-r-packages.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/pin-r-packages", + "title": "Título do Artigo Desconhecido", + "content": "When you use the\ninstall.packages()\nfunction to install CRAN packages, you cannot specify the version of the package, because the expectation is that you will install the latest version of the package and it should be compatible with the latest version of its dependencies. If you have an outdated dependency installed, it will be updated as well.\nSometimes you want to fix the version of an R package. There are several ways to do this:\nUse the\ndevtools package\n.\nDownload and install a package file from a\nCRAN archive\n.\nUse a CRAN snapshot.\nWhen you use the Libraries UI or API (\nAWS\n|\nAzure\n|\nGCP\n) to install R packages on all the instances of a cluster, we recommend the third option.\nThe Microsoft R Application Network maintains a\nCRAN Time Machine\nthat stores a snapshot of CRAN every night. The snapshots are available at\nhttps://cran.microsoft.com/snapshot/\nwhere\n\nis the date of the desired snapshot, for example, 2019-05-01. To install specific versions of R packages, specify this URL as the repository of your CRAN library (\nAWS\n|\nAzure\n|\nGCP\n)when you create the library." +} \ No newline at end of file diff --git a/scraped_kb_articles/powerbi-proxy-ssl-configuration.json b/scraped_kb_articles/powerbi-proxy-ssl-configuration.json new file mode 100644 index 0000000000000000000000000000000000000000..fece39fbec5ff9f915fb1d60eb7c1afcea1e0335 --- /dev/null +++ b/scraped_kb_articles/powerbi-proxy-ssl-configuration.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/bi/powerbi-proxy-ssl-configuration", + "title": "Título do Artigo Desconhecido", + "content": "Driver configurations\nYou can set driver configurations using the\nmicrosoft.sparkodbc.ini\nfile which can be found in the\nODBC Drivers\\Simba Spark ODBC Driver\ndirectory. The absolute path of the\nmicrosoft.sparkodbc.ini\ndirectory depends on whether you are using Power BI Desktop or on-premises Power BI Gateway:\nPower BI Desktop:\nC:\\Program Files\\Microsoft Power BI Desktop\\bin\\ODBC Drivers\\Simba Spark ODBC Driver\\microsoft.sparkodbc.ini\nPower BI Gateway:\nm\\ODBC Drivers\\Simba Spark ODBC Driver\\microsoft.sparkodbc.ini\n, where\nm\nis placed inside the gateway installation directory.\nSet driver configurations\nCheck if the\nmicrosoft.sparkodbc.ini\nfile was already created. If it is then jump to step 3.\nOpen\nNotepad\nor\nFile Explorer\nas\nRun As Administrator\nand create a file at\nODBC Drivers/Simba Spark ODBC Driver/microsoft.sparkodbc.ini\n.\nAdd the new driver configurations to the file below the header\n[Driver]\nby using the syntax =. Configuration keys can be found in the manual provided with the installation of the\nDatabricks ODBC Driver\n. The manual is located at\nC:\\Program Files\\Simba Spark ODBC Driver\\Simba Apache Spark ODBC Connector Install and Configuration Guide.html\n.\nConfiguring a proxy\nTo configure a proxy, add the following configurations to the driver configuration in the\nmicrosoft.sparkodbc.ini\nfile:\n[Driver]\r\nUseProxy=1\r\nProxyHost=\r\nProxyPort=\r\nProxyUID=\r\nProxyPWD=\nDepending on the firewall configuration it might also be necessary to add:\n[Driver]\r\nCheckCertRevocation=0\nTroubleshooting\nError: SSL_connect: certificate verify failed\nWhen SSL issues occur, the ODBC driver returns a generic error\nSSL_connect: certificate verify failed\n. You can get more detailed SSL debugging logs by setting in the\nODBC Drivers/Simba Spark ODBC Driver/microsoft.sparkodbc.ini\nfile the following two configurations:\n[Driver]\r\nAllowDetailedSSLErrorMessages=1\r\nEnableCurlDebugLogging=1\nDiagnose issues by analyzing CryptoAPI logs\nMost issues can be diagnosed by using Windows CryptoAPI logs, which can be found in the Event Viewer. The following steps describe how to capture these logs.\nOpen\nEvent Viewer\nand go to\nApplications and Services Logs\n>\nMicrosoft\n>\nWindows\n>\nCAPI2\n>\nOperational\n.\nIn\nFilter Current Log\n, check the boxes\nCritical\n,\nError\n, and\nWarning\nand click\nOK\n.\nIn the\nEvent Viewer\n, go to\nActions\n>\nEnable Log\nto start collecting logs.\nConnect\nPower BI\nto Azure Databricks to reproduce the issue.\nIn the Event Viewer, go to\nActions\n>\nDisable Log\nto stop collecting logs.\nClick\nRefresh\nto retrieve the list of collected events.\nExport logs by clicking\nActions\n>\nSave Filtered Log File As\n.\nDiagnose Build Chain or Verify Chain Policy event errors\nIf the collected logs contain an error on the\nBuild Chain\nor\nVerify Chain Policy\nevents, this likely points to the issue. More details can be found by selecting the event and reading the\nDetails\nsection. Two fields of interest are\nResult\nand\nRevocationResult\n.\nThe revocation status of the certificate or one of the certificates in the certificate chain is unknown.\nCAPI2 error\n:\nRevocationResult: [80092013] The revocation function was unable to check revocation because the revocation server was offline.\nCause\n: The revocation check failed due to an unavailable certificate revocation server.\nResolution\n: Disable certificate revocation checking.\nThe certificate chain is not complete.\nCAPI2 error\n:\nResult: [800B010A] A certificate chain could not be built to a trusted root authority.\nCause\n: The certificate advertised by the VPN or proxy server is incomplete and does not contain a full chain to the trusted root authority.\nResolution\n: The preferred solution is to configure the VPN or proxy server to advertise the full chain. If this is not possible, a workaround is to obtain the intermediate certificates for the Databricks workspace, and install these in the Intermediate Certification Authorities store, to enable Windows to find the unadvertised certificates. If possible, it is recommended to install these certificates for all Power BI users using a group policy in Windows. This has to be set up by the system administrator.\nCertificate configurations\nDisable certificate revocation checking\nIf the ODBC driver is unable to reach the certificate revocation list server, for example because of a firewall configuration, it will fail to validate the certificate. This can be resolved by disabling this check. To disable certificate revocation checking, set the configuration\nCheckCertRevocation=0\nto the\nmicrosoft.sparkodbc.ini\nfile.\nInstall intermediate certificates\nOpen your Azure Databricks workspace URL in Chrome and go to\nView site information\nby clicking the padlock icon in the address bar.\nClick\nCertificate\n>\nCertificate Path\nand repeat steps 3 to 6 for every intermediate certificate in the chain.\nChoose an intermediate certificate and go to\nDetails\n>\nCopy to File\n>\nNext\nto export the certificate.\nSelect the location of the certificate and click\nFinish\n.\nOpen the exported certificate and click\nInstall Certificate\n>\nNext\n.\nFrom the\nCertificate Import Wizard\nclick\nPlace all certificates in the following store\n>\nBrowse and choose Intermediate Certification Authorities\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/previously-working-jobs-now-failing-to-execute-with-metastore_does_not_exist-error.json b/scraped_kb_articles/previously-working-jobs-now-failing-to-execute-with-metastore_does_not_exist-error.json new file mode 100644 index 0000000000000000000000000000000000000000..1e138477e73d2b599183efc0aba04755b8af4b98 --- /dev/null +++ b/scraped_kb_articles/previously-working-jobs-now-failing-to-execute-with-metastore_does_not_exist-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/previously-working-jobs-now-failing-to-execute-with-metastore_does_not_exist-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn non-Unity Catalog (UC) workspaces, or workspaces with UC disabled, you experience previously working jobs fail to execute with the following error.\nMETASTORE_DOES_NOT_EXIST: No metastore assigned for the current workspace.\r\nOr \r\ncom.databricks.sql.managedcatalog.UnityCatalogServiceException: [RequestId=*****-****-****-****-******** ErrorClass=FEATURE_DISABLED] Unity Catalog is not available for feature tier STANDARD_TIER.\nCause\nThe error occurs for jobs created using the Databricks API without the\ndata_security_mode\nfield in the cluster properties, which are later updated through manual changes to the job cluster in the Databricks UI.\nThe\ndata_security_mode\nfield is crucial. When this field is missing, the job cluster will default to\nNone\n. When a job is later updated in the UI, it sets to\ndata_security_mode: SINGLE_USER\n.\nWhen a cluster has\ndata_security_mode: SINGLE_USER\n, it becomes UC enabled, triggering the issue.\nSolution\nOnly use the Databricks update job API to update the job cluster, and set the\ndata_security_mode: LEGACY_SINGLE_USER_STANDARD\n.\nFor legacy passthrough, set\ndata_security_mode: LEGACY_SINGLE_USER\n.\nReview the\nUpdate jobs settings partially\nAPI documentation for more details regarding  data_security_mode." +} \ No newline at end of file diff --git a/scraped_kb_articles/programmatically-determine-if-a-table-is-a-delta-table-or-not.json b/scraped_kb_articles/programmatically-determine-if-a-table-is-a-delta-table-or-not.json new file mode 100644 index 0000000000000000000000000000000000000000..58d668db5217d8f5aca4ae46cbb3e462a73568ac --- /dev/null +++ b/scraped_kb_articles/programmatically-determine-if-a-table-is-a-delta-table-or-not.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/programmatically-determine-if-a-table-is-a-delta-table-or-not", + "title": "Título do Artigo Desconhecido", + "content": "You may not always know the type of table you need to read. For example, if a given table is a Delta table you may need to read it differently than if it were a Parquet table.\nThis article explains how you can use Python code in a Databricks notebook to programmatically determine if a table is a Delta table or not.\nInstructions\nAttach your notebook to an all-purpose cluster.\nCopy the example code to your notebook.\nReplace the following values in the example code:\n\n- The name of the table you want to read\nRun the cell.\nIf the table is a Delta table, the example code returns\nYes, it is a Delta table\n.\nIf the table is not a Delta table, the example code returns\nNo, it is not a Delta table.\nYou can use this example code as a basis to build an automatic check into your notebook code.\nExample code\n%python\r\n\r\ndef delta_check(TableName: str) -> bool:\r\n desc_table = spark.sql(f\"describe formatted {TableName}\").collect()\r\n location = [i[1] for i in desc_table if i[0] == 'Location'][0]\r\n try:\r\n dir_check = dbutils.fs.ls(f\"{location}/_delta_log\")\r\n is_delta = True\r\n except Exception as e:\r\n is_delta = False\r\n return is_delta\r\n\r\nres = delta_check(\"\")\r\n\r\nif (res=\"True\")\r\n print(\"Yes, it is a Delta table.\")\r\nelse\r\n print(\"No, it is not a Delta table.\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/pulumi-fails-to-deploy-workflows-in-serverless-mode.json b/scraped_kb_articles/pulumi-fails-to-deploy-workflows-in-serverless-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..f855b379bdbb93288f48f7b9f5d1986281a615cd --- /dev/null +++ b/scraped_kb_articles/pulumi-fails-to-deploy-workflows-in-serverless-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/pulumi-fails-to-deploy-workflows-in-serverless-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are deploying a workflow via the\nPulumi IaC tool\nwhen you encounter cluster issues.\nThe workflow deployment is not allowed without one of the\njob_cluster_key\n,\nnew_cluster\n, or\nexisting_cluster_id\nto set it to run in serverless mode.\nUsers are unable to manually update a job to use serverless. It presents a similar error.\nCause\nThe primary issue is a configuration problem with the Pulumi provider, which requires one of the parameters (\njob_cluster_key\n,\nnew_cluster\n, or\nexisting_cluster_id\n) to trigger a workflow. Pulumi can trigger workflows (serverless or multi-task jobs) with a REST activity.\nThe issue may be caused by an outdated version of the Pulumi provider or a configuration error in the Pulumi payload.\nSolution\nUpdate your Pulumi provider to the latest version and check the Pulumi payload for any configuration errors to resolve the primary issue.\nMake sure you remove the job cluster definition from your payload.\nPlease specify your payload like this example.\nYou need to specify the following values before using the example payload:\n\n- The resource prefix from the Pulumi configuration.\n\n- The email address that you want to receive notifications.\n\n- The home directory for your user in your Databricks workspace.\njob = Job(\r\n resource_name = f\"-job\",\r\n name = f\"-job\",\r\n tasks = [\r\n  JobTaskArgs(\r\n   task_key = f\"-task\",\r\n   notebook_task = JobNotebookTaskArgs(\r\n    notebook_path = f\"/Pulumi/-notebook.py\"\r\n   )\r\n  )\r\n ],\r\n email_notifications = JobEmailNotificationsArgs(\r\n  on_successes = [ ],\r\n  on_failures = [ ]\r\n )\r\n)\nIf you are unable to manually update the job to use serverless, try the following steps:\nOpen your workspace and click\nWorkflows\n.\nClick\nJob & pipelines\n.\nClick the job that needs to be updated.\nClick\nEdit\n.\nRemove the\njobClusters\nproperty from the job definition.\nSave the changes and deploy the workflow." +} \ No newline at end of file diff --git a/scraped_kb_articles/py4jjavaerror-when-trying-to-install-libraries-on-ssl-encrypted-cluster.json b/scraped_kb_articles/py4jjavaerror-when-trying-to-install-libraries-on-ssl-encrypted-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..0272f9dd12f1b92887a9efb76d96a4a83a117212 --- /dev/null +++ b/scraped_kb_articles/py4jjavaerror-when-trying-to-install-libraries-on-ssl-encrypted-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/py4jjavaerror-when-trying-to-install-libraries-on-ssl-encrypted-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to install libraries using\n%pip\nor\n%conda\ncommands on a cluster with SSL encryption enabled, you receive an error.\nPy4JJavaError: An error occurred while calling t.getNotebookScopedPythonEnvManager. : org.apache.spark.SparkException: %pip/%conda commands use unencrypted NFS and are disabled by default when SSL encryption is enabled. NFS can be safely used to install libraries that do not contain PHI or other sensitive data, such as open source packages. %pip/%conda commands or NFS should not be used to transmit PHI to Spark workers. To enable %pip/%conda commands, set spark.databricks.libraries.ignoreSSL to true in Spark config in cluster settings and restart your cluster.\nCause\nWhen SSL encryption is enabled on a cluster,\n%pip\nand\n%conda\ncommands are disabled by default because they use unencrypted NFS. This is a security measure to prevent the transmission of sensitive data over unencrypted channels.\nSolution\nIn your cluster, click\nAdvanced Options\n.\nNavigate to the\nSpark\ntab.\nIn the\nSpark config\nbox, enter\nspark.databricks.libraries.ignoreSSL true\n.\nRestart your cluster to apply the new configuration.\nImportant\nYou can safely set the\nspark.databricks.libraries.ignoreSSL\nconfiguration to true when installing open source packages, as long as the packages don’t contain protected health information (PHI) or other sensitive data.\nIf you have further security concerns, consult with your internal security team for guidance.\nFor more information, refer to the\nEncrypt traffic between cluster worker nodes\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/pyarrow-hotfix-breaking-change.json b/scraped_kb_articles/pyarrow-hotfix-breaking-change.json new file mode 100644 index 0000000000000000000000000000000000000000..88bd9339b543a15f3341b9a1230da48c5682dff0 --- /dev/null +++ b/scraped_kb_articles/pyarrow-hotfix-breaking-change.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/pyarrow-hotfix-breaking-change", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nOn Dec 5th, 2023, Databricks rolled out a security update to all supported Databricks Runtime versions to address a critical vulnerability (\nCVE-2023-47248\n) from the PyArrow python package embedded in Databricks Runtime.\nWith this update, Databricks packages and automatically activates\npyarrow-hotfix\nas remediation. The PyArrow community recommends this solution. This change turns off the vulnerable feature from PyArrow that is not used by Databricks Runtime by default. However, this change could break some workloads customized to use the vulnerable feature from PyArrow.\nThis security update is only applied to PyArrow embedded in Databricks Runtime versions, but not in cases where you installed your version of PyArrow on a Databricks cluster.\nImpact\nThere is no impact if your workloads don’t use the PyArrow extension datatype\npyarrow.PyExtensionType\n.\nThere is no impact if your workloads use the secure PyArrow extension datatype,\npyarrow.ExtensionType\n.\nIf your workloads use the vulnerable PyArrow extension datatype\npyarrow.PyExtensionType\n, then they will fail with the following error message:\nFound disallowed extension datatype (arrow.py_extension_type), please check \r\nhttps://kb.databricks.com/pyarrow-hotfix-breaking-change for helps. \r\n\r\nOriginal error message from pyarrow-hotfix: \r\nDisallowed deserialization of 'arrow.py_extension_type': \r\n......\nSolution\nYou can fix the workload by completing one of the following actions.\nDatabricks recommends performing a code change to remediate the vulnerability.\nOption 1: Code change (recommended)\nIf you use\npyarrow.PyExtensionType\nto process Parquet files, change your Parquet files and data processing code to use the secure API,\npyarrow.ExtensionType\ninstead of\npyarrow.PyExtensionType\n. This long-term solution requires changes to the code or process that produces the Parquet files.\nOption 2: Turn off the security update\nIf you cannot perform the code changes and you trust the provider of the Parquet files, you can temporarily turn off the security update at your own risk. This solution is a temporary remediation, available to Databricks Runtime versions 14.2 and below.\nSet the following environment variable to turn off the automatic activation of the security update by setting the following\nenvironment variable\n(\nAWS\n|\nAzure\n|\nGCP\n) in the configurations of all affected clusters.\nDATABRICKS_DISABLE_AUTO_PYARROW_HOTFIX=True\nTurning off the automatic activation re-enables PyArrow’s ability to work with parquet files that use the vulnerable API,\npyarrow.PyExtensionType\n.\nHelp me choose\nDo you use extension datatype in PyArrow in your workload?\nYes\nNo\nDo you use pyarrow.PyExtensionType (insecure)?\nYes\nNo\nAre you comfortable migrating to\npyarrow.ExtensionType\n(secure)?\nYes\nNo, I want to DISABLE the security update.\nYou should follow the steps in\nOption 1: Code change\n.\nDo you trust the parquet file provider?\nYes\nNo\nYou should follow the steps in\nOption 2: Turn off the security update\n.\nYour system is in an UNSAFE state. This is not recommended.\nThis issue does not impact you.\nThis issue does not impact you." +} \ No newline at end of file diff --git a/scraped_kb_articles/pyjerror-when-using-the-tojson-method-in-standard-access-mode-compute.json b/scraped_kb_articles/pyjerror-when-using-the-tojson-method-in-standard-access-mode-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..d0b6f682d6517247e6d3b2c6355d6c446c06434a --- /dev/null +++ b/scraped_kb_articles/pyjerror-when-using-the-tojson-method-in-standard-access-mode-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/pyjerror-when-using-the-tojson-method-in-standard-access-mode-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using the\ntoJson()\nmethod with\ndbutils\nin a Python notebook on a standard (formerly shared) access mode compute cluster.\ndbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()\nUpon executing, you receive a\nPyJError\n.\nPyJError: An error occurred while calling 0450.toJson.\r\nTrace: py4j.security.Py4JSecurityException: Method public java.lang.String com.databricks.backend.common.rpc.CommandContext.toJson() is not allowlisted on class class com.databricks.backend.common.rpc.CommandContext\nCause\nThe\ntoJson()\nmethod is not allowlisted on standard access mode clusters.\nSolution\nChange the method from\ntoJson()\nto\nsafeToJson()\nwith Databricks Runtime 13.3 LTS or above. This method is allowed in the standard access mode. It provides a subset of all command context information that can be securely shared on the cluster.\ndbutils.notebook.entry_point.getDbutils().notebook().getContext().safeToJson()" +} \ No newline at end of file diff --git a/scraped_kb_articles/pypmml-fail-find-py4j-jar.json b/scraped_kb_articles/pypmml-fail-find-py4j-jar.json new file mode 100644 index 0000000000000000000000000000000000000000..e55529cf8edcb7d6f5d633e2b1853e45512161b3 --- /dev/null +++ b/scraped_kb_articles/pypmml-fail-find-py4j-jar.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/pypmml-fail-find-py4j-jar", + "title": "Título do Artigo Desconhecido", + "content": "Delete\nNote\nThis KB article is for Databricks Runtime 10.4 LTS. If you use Databricks Runtime 11.3 LTS or above, refer to the\nNotebook or workflow fails with “Error : Py4JError: Could not find py4j jar at” error after trying to install PyPMML on a cluster\nKB article instead.\nProblem\nPyPMML is a Python PMML scoring library.\nAfter installing PyPMML in a Databricks cluster, it fails with a\nPy4JError: Could not find py4j jar\nerror.\n%python\r\n\r\nfrom pypmml import Model\r\nmodelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml')\r\n\r\nError : Py4JError: Could not find py4j jar at\nCause\nThis error occurs due to a dependency on the default Py4J library.\nDatabricks Runtime 5.0-6.6 uses Py4J 0.10.7.\nDatabricks Runtime 7.0 and above uses Py4J 0.10.9.\nThe default Py4J library is installed to a different location than a standard Py4J package. As a result, when PyPMML attempts to invoke Py4J from the default path, it fails.\nSolution\nSetup a cluster-scoped init script that copies the required Py4J jar file into the expected location.\nUse pip to install the version of Py4J that corresponds to your Databricks Runtime version.\nFor example, in Databricks Runtime 6.5 run\npip install py4j==<0.10.7>\nin a notebook in install Py4J 0.10.7 on the cluster.\nRun\nfind /databricks/ -name \"py4j*jar\"\nin a notebook to confirm the full path to the Py4J jar file. It is usually located in a path similar to\n/databricks/python3/share/py4j/\n.\nManually copy the Py4J jar file from the install path to the DBFS path\n/dbfs/py4j/\n.\nRun the following code snippet in a Python notebook to create the\ninstall-py4j-jar.sh\ninit script. Make sure the version number of Py4J listed in the snippet corresponds to your Databricks Runtime version.\n%python\r\n\r\ndbutils.fs.put(\"/databricks/init-scripts/install-py4j-jar.sh\", \"\"\"\r\n\r\n#!/bin/bash\r\necho \"Copying at `date`\"\r\nmkdir -p /share/py4j/ /current-release/\r\ncp /dbfs/py4j/py4j.jar /share/py4j/\r\ncp /dbfs/py4j/py4j.jar /current-release/\r\necho \"Copying completed at `date`\"\r\n\r\n\"\"\", True)\nAttach the\ninstall-py4j-jar.sh\ninit script to your cluster, following the instructions in configure a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nRestart the cluster.\nVerify that PyPMML works as expected." +} \ No newline at end of file diff --git a/scraped_kb_articles/pyspark-merge-operation-with-withschemaevolution-fails-on-serverless-compute.json b/scraped_kb_articles/pyspark-merge-operation-with-withschemaevolution-fails-on-serverless-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..ef6f92d8a23dae58b67331f33920ea1fd352b56d --- /dev/null +++ b/scraped_kb_articles/pyspark-merge-operation-with-withschemaevolution-fails-on-serverless-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/pyspark-merge-operation-with-withschemaevolution-fails-on-serverless-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running an Apache Spark PySpark job workflow on a serverless compute in serverless environment version 1, you try to conduct a\nMERGE\noperation with\nwithSchemaEvolution\nusing Delta Lake and receive the following error.\nError message: AttributeError: 'DeltaMergeBuilder' object has no attribute 'withSchemaEvolution'\nThis issue occurs despite using a Databricks Runtime version that supports schema evolution (15.4 LTS and above).\nCause\nCertain Apache Spark configurations, including those required for schema evolution (\nspark.databricks.delta.schema.autoMerge.enabled\n), are not supported in serverless compute version 1 environments. As a result, the\nwithSchemaEvolution\nmethod, which relies on these configurations, is also not supported.\nFor more information, refer to the\nServerless compute limitations\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nTo review the Spark configs supported in serverless, refer to the “​​Configure Spark properties for serverless notebooks and jobs” section of the\nSet Spark configuration properties on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor more information about schema evolution, review the “Schema evolution syntax for merge” section of the\nUpdate Delta Lake table schema\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nEither use a job compute or all-purpose compute instead of serverless, use SQL to perform the\nMERGE\noperation with schema evolution, or use serverless environment version 2 or above.\nUse a job compute or all-purpose compute\nInstead of using serverless compute, switch to a job cluster or an all-purpose cluster with Databricks Runtime 15.4 LTS and above where the\nwithSchemaEvolution\nmethod is supported.\nThis involves changing the compute configuration for your Databricks job workflow. Please refer to the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information.\nUse SQL to perform the MERGE operation with schema evolution\nAlternatively, use SQL to perform the\nMERGE\noperation with schema evolution. You can execute the operation directly in a SQL cell or in PySpark using\nspark.sql()\n.\nDirect SQL example\n%sql\r\n   MERGE WITH SCHEMA EVOLUTION INTO t\r\n   USING source s\r\n   ON s.id = t.id\r\n   WHEN MATCHED THEN\r\n   UPDATE SET *\r\n   WHEN NOT MATCHED THEN\r\n   INSERT *\nSQL in PySpark example\nspark.sql(\"\"\"\r\n       MERGE WITH SCHEMA EVOLUTION INTO t\r\n       USING source s\r\n       ON s.id = t.id\r\n       WHEN MATCHED THEN\r\n       UPDATE SET *\r\n       WHEN NOT MATCHED THEN\r\n       INSERT *\r\n   \"\"\")\nUse serverless environment version 2 or above\nFor more information, refer to the\nServerless environment versions\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/pysparkassertionerror-received-incorrect-server-side-session-identifier-for-request.json b/scraped_kb_articles/pysparkassertionerror-received-incorrect-server-side-session-identifier-for-request.json new file mode 100644 index 0000000000000000000000000000000000000000..afaeced17a3643ce14bd066b5d978396942d87bf --- /dev/null +++ b/scraped_kb_articles/pysparkassertionerror-received-incorrect-server-side-session-identifier-for-request.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/pysparkassertionerror-received-incorrect-server-side-session-identifier-for-request", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running a notebook on serverless when you get a PySpark assertion error message.\nError: PySparkAssertionError: Received incorrect server side session identifier for request. Please create a new Spark Session to reconnect. (52a6f5e0-3410-4f58-8a4e-ca81a4f41dc0 != 13140493-d7a8-4d33-85fd-044e753afef6)\nThe error persists even if you try to start a new Apache Spark session.\nCause\nThis can happen if the target cluster has restarted or crashed. It is part of the crash detection features in Spark Connect.\nThe server maintains a session id which it sends to the client in every response. When the client first gets an RPC back from the server, it records the server session id, and throws an error if it ever receives a different server session id.\nIf the server crashes, restarts, or otherwise loses the session state, the client sees a new session id and the error is thrown.\nSolution\nYou must detach and reattach to serverless compute to reset the state.\nTo detach and reattach a notebook to serverless compute in Databricks, follow these steps:\nClick the cluster dropdown menu in the notebook toolbar.\nHover over the attached cluster in the list to display a side menu.\nClick\nDetach & re-attach\nfrom the menu options." +} \ No newline at end of file diff --git a/scraped_kb_articles/pysparkvalueerror-when-working-with-udfs-in-apache-spark.json b/scraped_kb_articles/pysparkvalueerror-when-working-with-udfs-in-apache-spark.json new file mode 100644 index 0000000000000000000000000000000000000000..d855b86a203a7051a254bfd4b95cf066cf30ab02 --- /dev/null +++ b/scraped_kb_articles/pysparkvalueerror-when-working-with-udfs-in-apache-spark.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/pysparkvalueerror-when-working-with-udfs-in-apache-spark", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with user-defined functions (UDFs) in Apache Spark, you encounter the following error.\npyspark.errors.exceptions.base.PySparkValueError: [UNEXPECTED_TUPLE_WITH_STRUCT] Unexpected tuple {} with StructType.\nExample of problem-creating code\nThe following code demonstrates a UDF where the returned object value is not\nStructType\n, which leads to the error.\n%python\r\n\r\nfrom pyspark.sql.functions import udf\r\nfrom pyspark.sql.types import StructType\r\n\r\n# Define a UDF, which returns your object value instead of StructType.\r\ndef faulty_udf(value):\r\n    return  {}   \r\n\r\n# Register the UDF with StructType, which does not match the object value output above.\r\nfaulty_udf_spark = udf(faulty_udf, StructType())\r\n\r\ndata = [(1,), (2,), (3,)]\r\ndf = spark.createDataFrame(data, [\"input\"])\r\ndf_with_faulty_udf = df.withColumn(\"output\", faulty_udf_spark(df[\"input\"]))\r\ndf_with_faulty_udf.show()\nCause\nYou’re using an invalid individual output row type.\nSolution\nEnsure your UDF’s runtime output values use a schema that matches the schema defined in your source code. Any return parameter defined in the UDF that does not match the source code schema will throw the error." +} \ No newline at end of file diff --git a/scraped_kb_articles/pystan-fails-dbr64es.json b/scraped_kb_articles/pystan-fails-dbr64es.json new file mode 100644 index 0000000000000000000000000000000000000000..0616916c71c37476baccacb753db367e03911cdc --- /dev/null +++ b/scraped_kb_articles/pystan-fails-dbr64es.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/pystan-fails-dbr64es", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to install the\nPyStan\nPyPi package on a Databricks Runtime 6.4 Extended Support cluster and get a\nManagedLibraryInstallFailed\nerror message.\njava.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, pystan, --disable-pip-version-check) exited with code 1.   Could not find a version that satisfies the requirement httpstan<4.5,>=4.4 (from pystan) (from versions: 0.1.0, 0.1.1, 0.2.3, 0.2.5, 0.3.0, 0.3.1, 0.4.0, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.5, 0.7.6, 0.8.0, 0.9.0, 0.10.1, 1.0.0)\r\nNo matching distribution found for httpstan<4.5,>=4.4 (from pystan)\r\n for library:PythonPyPiPkgId(pystan,None,None,List()),isSharedLibrary=false\nCause\nWhen you install PyStan via PyPi, it attempts to install the latest version.\nPyStan 3.0.0 and above are not compatible with Databricks Runtime 6.4 Extended Support.\nSolution\nYou should use pystan version 2.19.1.1 on Databricks Runtime 6.4 Extended Support.\nSpecify\npystan==2.19.1.1\nwhen you install the library on your cluster (\nAWS\n|\nAzure\n). This is the most recent version that is compatible with Databricks Runtime 6.4 Extended Support.\nIf you require pystan version 3.0.0 or above, you should upgrade to Databricks Runtime 7.3 LTS or above." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-2-eol.json b/scraped_kb_articles/python-2-eol.json new file mode 100644 index 0000000000000000000000000000000000000000..55a4785cd904f7cbc25ed9c35a3d1bf548c3ecb9 --- /dev/null +++ b/scraped_kb_articles/python-2-eol.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/python-2-eol", + "title": "Título do Artigo Desconhecido", + "content": "Python.org officially moved\nPython 2 into EoL (end-of-life) status\non January 1, 2020.\nWhat does this mean for you?\nDatabricks Runtime 6.0 and above\nDatabricks Runtime 6.0 and above support only Python 3. You cannot create a cluster with Python 2 using these runtimes. Any clusters created with these runtimes use Python 3 by definition.\nDatabricks Runtime 5.5 LTS\nWhen you create a Databricks Runtime 5.5 LTS cluster by using the workspace UI, the default is Python 3. You have the option to specify Python 2. If you use the Databricks REST API (\nAWS\n|\nAzure\n) to create a cluster using Databricks Runtime 5.5 LTS, the default is Python 2. If you have a Databricks Runtime 5.5 LTS cluster running Python 2, you are not required to upgrade to Python 3.\nYou can use the following call to specify Python 3 when you create a cluster using the Databricks REST API.\n\"spark_env_vars\": {\r\n  \"PYSPARK_PYTHON\": \"/databricks/python3/bin/python3\"\r\n},\nShould I upgrade to Python 3?\nThe decision to upgrade depends on your specific circumstances, including reliance on other systems and dependencies. This is a decision that should be made in conjunction with your engineering organization.\nThe official Python.org statement is as follows:\nAs of January 1st, 2020 no new bug reports, fixes, or changes will be made to Python 2, and Python 2 is no longer supported. We have not yet released the few changes made between when we released Python 2.7.17 (on October 19th, 2019) and January 1st. As a service to the community, we will bundle those fixes (and only those fixes) and release a 2.7.18. We plan on doing that in April 2020, because that’s convenient for the release managers, not because it implies anything about when support ends.\nSupport\nDatabricks does not offer official support for discontinued third-party software.\nSupport requests related to Python 2 are not eligible for engineering support." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-cmd-fail-conda-cluster.json b/scraped_kb_articles/python-cmd-fail-conda-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..955163e276085d62ed42c77faf30aed37d9d36ab --- /dev/null +++ b/scraped_kb_articles/python-cmd-fail-conda-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/python-cmd-fail-conda-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using a Databricks Runtime for Machine Learning cluster and Python notebooks are failing.\nYou find an\ninvalid syntax error\nin the logs.\nSyntaxError: invalid syntax\r\n  File \"/local_disk0/tmp/1593092990800-0/PythonShell.py\", line 363\r\n    def __init__(self, *args, condaMagicHandler=None, **kwargs):\nCause\nKey values in the\n/etc/environment/\nfile are being overwritten by user environment variables.\nThere are several default environment variables that should not be overwritten.\nFor example,\nMLFLOW_CONDA_HOME=/databricks/conda\nis set by default. If you overwrite this value it can result in the\ninvalid syntax\nerror.\nThis sample init script can cause the issue, because it is replacing, rather than appending a value.\n%python\r\n\r\ndbutils.fs.put(\"/databricks/init-scripts/set-env.sh\", \"\"\"#!/bin/bash\r\nsudo echo VAR1=\"VAL1\" > /etc/environment\r\nsudo echo VAR2=\"VAL2\" > /etc/environment\r\nsudo echo VAR3=\"VAL3\" > /etc/environment\r\n\"\"\", true)\nSolution\nYou should not overwrite any values in the\n/etc/environment/\nfile.\nYou should always append variables to the\n/etc/environment/\nfile.\nThis sample init script avoids the issue by appending every to value to the\n/etc/environment/\nfile.\n%python\r\n\r\ndbutils.fs.put(\"/databricks/init-scripts/set-env.sh\", \"\"\"#!/bin/bash\r\nsudo echo VAR1=\"VAL1\" >> /etc/environment\r\nsudo echo VAR2=\"VAL2\" >> /etc/environment\r\nsudo echo VAR3=\"VAL3\" >> /etc/environment\r\n\"\"\", true)" +} \ No newline at end of file diff --git a/scraped_kb_articles/python-cmd-fail-high-con-cluster.json b/scraped_kb_articles/python-cmd-fail-high-con-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..84d89765be0b8392c3908c456e85f0d39763d07f --- /dev/null +++ b/scraped_kb_articles/python-cmd-fail-high-con-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/python-cmd-fail-high-con-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to run Python commands on a high concurrency cluster.\nAll Python commands fail with a\nWARN\nerror message.\nWARN PythonDriverWrapper: Failed to start repl ReplId-61bef-9fc33-1f8f6-2\r\nExitCodeException exitCode=1: chown: invalid user: ‘spark-9fcdf4d2-045d-4f3b-9293-0f’\nCause\nBoth\nspark.databricks.pyspark.enableProcessIsolation true\nand\nspark.databricks.session.share true\nare set in the Apache Spark configuration on the cluster.\nThese two Spark properties conflict with each other and prevent the cluster from running Python commands.\nSolution\nYou can only have one of these two Spark properties enabled on your cluster at a time.\nYou must choose process isolation or a Spark shared session based on your needs. Disable the other option." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-cmd-fails-tornado-version.json b/scraped_kb_articles/python-cmd-fails-tornado-version.json new file mode 100644 index 0000000000000000000000000000000000000000..672b3d7f2d4632f762864604987154423f80feb6 --- /dev/null +++ b/scraped_kb_articles/python-cmd-fails-tornado-version.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/python-cmd-fails-tornado-version", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe cluster returns\nCancelled\nin a Python notebook. Inspect the driver log (\nstd.err\n) in the Cluster Configuration page for a stack trace and error message similar to the following:\nlog4j:WARN No appenders could be found for logger (com.databricks.conf.trusted.ProjectConf$).\r\nlog4j:WARN Please initialize the log4j system properly.\r\nlog4j:WARN See https://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.\r\nOpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0\r\nTraceback (most recent call last):\r\n  File \"/local_disk0/tmp/1551693540856-0/PythonShell.py\", line 30, in \r\n    from IPython.nbconvert.filters.ansi import ansi2html\r\n  File \"/databricks/python/lib/python3.5/site-packages/IPython/nbconvert/__init__.py\", line 6, in \r\n    from . import postprocessors\r\n  File \"/databricks/python/lib/python3.5/site-packages/IPython/nbconvert/postprocessors/__init__.py\", line 6, in \r\n    from .serve import ServePostProcessor\r\n  File \"/databricks/python/lib/python3.5/site-packages/IPython/nbconvert/postprocessors/serve.py\", line 29, in \r\n    class ProxyHandler(web.RequestHandler):\r\n  File \"/databricks/python/lib/python3.5/site-packages/IPython/nbconvert/postprocessors/serve.py\", line 31, in ProxyHandler\r\n    @web.asynchronous\r\nAttributeError: module 'tornado.web' has no attribute 'asynchronous'\nCause\nWhen you install the\nbokeh\nlibrary, by default\ntornado\nversion 6.0a1 is installed, which is an alpha release. The alpha release causes this error, so the solution is to revert to the stable version of\ntornado\n.\nSolution\nFollow the steps below to create a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n). The init script removes the newer version of\ntornado\nand installs the stable version.\nIf the init script does not already exist, create a base directory to store it:\n%sh\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the following script:\n%sh\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//tornado.sh\",\"\"\"\r\n#!/bin/bash\r\npip uninstall --yes tornado\r\nrm -rf /home/ubuntu/databricks/python/lib/python3.5/site-packages/tornado*\r\nrm -rf /databricks/python/lib/python3.5/site-packages/tornado*\r\n/usr/bin/yes | /home/ubuntu/databricks/python/bin/pip install tornado==5.1.1\r\n\"\"\",True)\nConfirm that the script exists:\n%sh\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//tornado.sh\"))\nGo to the cluster configuration page (\nAWS\n|\nAzure\n|\nGCP\n) and click the\nAdvanced Options\ntoggle.\nAt the bottom of the page, click the\nInit Scripts\ntab:\nIn the\nDestination\ndrop-down, select\nDBFS\n, provide the file path to the script, and click\nAdd\n.\nRestart the cluster.\nFor more information, see:\nCI failures with tornado 6.0a1\nConvert proxy handler from callback to coroutine" +} \ No newline at end of file diff --git a/scraped_kb_articles/python-command-cancelled.json b/scraped_kb_articles/python-command-cancelled.json new file mode 100644 index 0000000000000000000000000000000000000000..8fff238a1f31e2a5d88f7731c61f5d3f98af7a13 --- /dev/null +++ b/scraped_kb_articles/python-command-cancelled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/python-command-cancelled", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe cluster returns\nCancelled\nin a Python notebook. Notebooks in all other languages execute successfully on the same cluster.\nCause\nWhen you install a conflicting version of a library, such as\nipython\n,\nipywidgets\n,\nnumpy\n,\nscipy\n, or\npandas\nto the\nPYTHONPATH\n, then the Python REPL can break, causing all commands to return\nCancelled\nafter 30 seconds. This also breaks %sh, the notebook macro that lets you enter shell scripts in Python notebook cells.\nDelete\nInfo\nIf you’ve recently installed a bokeh library on the cluster, the installation may have included an incompatible tornado library. See\nCluster cancels Python command execution after installing Bokeh.\nIf you’ve installed a numpy library, it may be incompatible. See\nPython command execution fails with AttributeError\n.\nSolution\nTo solve this problem, do the following:\nIdentify the conflicting library and uninstall it.\nInstall the correct version of the library in a notebook or with a cluster-scoped init script.\nIdentify the conflicting library\nUninstall each library one at a time, and check if the Python REPL still breaks.\nIf the REPL still breaks, reinstall the library you removed and remove the next one.\nWhen you find the library that causes the REPL to break, install the correct version of that library using one of the two methods below.\nYou can also inspect the driver log (\nstd.err\n) for the cluster (on the Cluster Configuration page) for a stack trace and error message that can help identify the library conflict.\nInstall the correct library\nDo one of the following.\nOption 1: Install in a notebook using pip3\n%sh \r\n\r\nsudo apt-get -y install python3-pip\r\n  pip3 install \nOption 2: Install using a cluster-scoped init script\nFollow the steps below to create a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n) that installs the correct version of the library. Replace\n\nin the examples with the filename of the library to install.\nIf the init script does not already exist, create a base directory to store it:\n%sh\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the following script:\n%sh\r\n\r\ndbutils.fs.put(\"/databricks/init/cluster-name/.sh\",\"\"\"\r\n #!/bin/bash\r\n sudo apt-get -y install python3-pip\r\n sudo pip3 install \r\n \"\"\", True)\nConfirm that the script exists:\n%sh\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//.sh\"))\nGo to the cluster configuration page (\nAWS\n|\nAzure\n|\nGCP\n) and click the\nAdvanced Options\ntoggle.\nAt the bottom of the page, click the\nInit Scripts\ntab:\nIn the\nDestination\ndrop-down, select\nDBFS\n, provide the file path to the script, and click\nAdd\n.\nRestart the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-exec-display-cancelled.json b/scraped_kb_articles/python-exec-display-cancelled.json new file mode 100644 index 0000000000000000000000000000000000000000..e69f36777ea0006c23aff305696d1a97dee40b65 --- /dev/null +++ b/scraped_kb_articles/python-exec-display-cancelled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/python-exec-display-cancelled", + "title": "Título do Artigo Desconhecido", + "content": "This article can help you resolve scenarios in which Python command execution fails with an\nAttributeError\n.\nProblem: 'tuple' object has no attribute 'type'\nWhen you run a notebook, Python command execution fails with the following error and stack trace:\nAttributeError: 'tuple' object has no attribute 'type'\nTraceback (most recent call last):\r\nFile \"/local_disk0/tmp/1547561952809-0/PythonShell.py\", line 23, in \r\n  import matplotlib as mpl\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/matplotlib/__init__.py\", line 122, in \r\n  from matplotlib.cbook import is_string_like, mplDeprecation, dedent, get_label\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/matplotlib/cbook.py\", line 33, in \r\n  import numpy as np\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/numpy/__init__.py\", line 142, in \r\n  from . import core\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/numpy/core/__init__.py\", line 57, in \r\n  from . import numerictypes as nt\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/numpy/core/numerictypes.py\", line 111, in \r\n  from ._type_aliases import (\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/numpy/core/_type_aliases.py\", line 63, in \r\n  _concrete_types = {v.type for k, v in _concrete_typeinfo.items()}\r\nFile \"/databricks/python/local/lib/python2.7/site-packages/numpy/core/_type_aliases.py\", line 63, in \r\n  _concrete_types = {v.type for k, v in _concrete_typeinfo.items()}\r\nAttributeError: 'tuple' object has no attribute 'type'\r\n\r\n\r\n19/01/15 11:29:26 WARN PythonDriverWrapper: setupRepl:ReplId-7d8d1-8cc01-2d329-9: at the end, the status is\r\nError(ReplId-7d8d1-8cc01-2d329-,com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: Python shell failed to start in 30 seconds)\nCause\nA newer version of\nnumpy\n(1.16.1), which is installed by default by some PyPI clients, is incompatible with other libraries.\nSolution\nFollow the steps below to create a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n) that removes the current version and installs version 1.15.0 of numpy.\nIf the init script does not already exist, create a base directory to store it:\n%python\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the following script:\n%python\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//numpy.sh\",\"\"\"\r\n#!/bin/bash\r\npip uninstall --yes numpy\r\nrm -rf /home/ubuntu/databricks/python/lib/python3.5/site-packages/numpy*\r\nrm -rf /databricks/python/lib/python3.5/site-packages/numpy*\r\n/usr/bin/yes | /home/ubuntu/databricks/python/bin/pip install numpy==1.15.0\r\n\"\"\",True)\nConfirm that the script exists:\n%python\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//numpy.sh\"))\nGo to the cluster configuration page (\nAWS\n|\nAzure\n|\nGCP\n) and click the\nAdvanced Options\ntoggle.\nAt the bottom of the page, click the\nInit Scripts\ntab:\nIn the\nDestination\ndrop-down, select\nDBFS\n, provide the file path to the script, and click\nAdd\n.\nRestart the cluster.\nIn your PyPI client, pin the\nnumpy\ninstallation to version 1.15.1, the latest working version.\nProblem: module 'lib' has no attribute 'SSL_ST_INIT'\nWhen you run a notebook, library installation fails and all Python commands executed on the notebook are cancelled with the following error and stack trace:\nAttributeError: module 'lib' has no attribute 'SSL_ST_INIT'\nTraceback (most recent call last): File \"/databricks/python3/bin/pip\", line 7, in \r\n from pip._internal import main\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/__init__.py\", line 40, in \r\n from pip._internal.cli.autocompletion import autocomplete\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/cli/autocompletion.py\", line 8, in \r\n from pip._internal.cli.main_parser import create_main_parser\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/cli/main_parser.py\", line 12, in \r\n from pip._internal.commands import (\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/commands/__init__.py\", line 6, in \r\n from pip._internal.commands.completion import CompletionCommand\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/commands/completion.py\", line 6, in \r\n from pip._internal.cli.base_command import Command\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/cli/base_command.py\", line 20, in \r\n from pip._internal.download import PipSession\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_internal/download.py\", line 15, in \r\n from pip._vendor import requests, six, urllib3\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_vendor/requests/__init__.py\", line 97, in \r\n from pip._vendor.urllib3.contrib import pyopenssl\r\nFile \"/databricks/python3/lib/python3.5/site-packages/pip/_vendor/urllib3/contrib/pyopenssl.py\", line 46, in \r\n import OpenSSL.SSL\r\nFile \"/databricks/python3/lib/python3.5/site-packages/OpenSSL/__init__.py\", line 8, in \r\n from OpenSSL import rand, crypto, SSL\r\nFile \"/databricks/python3/lib/python3.5/site-packages/OpenSSL/SSL.py\", line 124, in \r\n SSL_ST_INIT = _lib.SSL_ST_INIT AttributeError: module 'lib' has no attribute 'SSL_ST_INIT'\nCause\nA newer version of the\ncryptography\npackage (in this case, 2.7) was installed by default along with another PyPI library, and this\ncryptography\nversion is incompatible with the version of\npyOpenSSL\nincluded in Databricks Runtimes.\nSolution\nTo resolve and prevent this issue, upgrade\npyOpenSSL\nto the most recent version before you install any library. Use a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n) to install the most recent version of\npyOpenSSL\n:\nIf the init script does not already exist, create a base directory to store it:\n%python\r\n\r\ndbutils.fs.mkdirs(\"dbfs:/databricks//\")\nCreate the following script:\n%python\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//openssl_fix.sh\",\"\"\"\r\n#!/bin/bash\r\necho \"Removing pyOpenSSL package\"\r\nrm -rf /databricks/python2/lib/python2.7/site-packages/OpenSSL\r\nrm -rf /databricks/python2/lib/python2.7/site-packages/pyOpenSSL-16.0.0-*.egg-info\r\nrm -rf /databricks/python3/lib/python3.5/site-packages/OpenSSL\r\nrm -rf /databricks/python3/lib/python3.5/site-packages/pyOpenSSL-16.0.0*.egg-info\r\n/databricks/python2/bin/pip install pyOpenSSL==19.0.0\r\n/databricks/python3/bin/pip3 install pyOpenSSL==19.0.0\r\n\"\"\", True)\nConfirm that the script exists:\n%python\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks//openssl_fix.sh\"))\nGo to the cluster configuration page (\nAWS\n|\nAzure\n|\nGCP\n) and click the\nAdvanced Options\ntoggle.\nAt the bottom of the page, click the\nInit Scripts\ntab:\nIn the\nDestination\ndrop-down, select\nDBFS\n, provide the file path to the script, and click\nAdd\n.\nRestart the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-kernel-is-unresponsive-error-message.json b/scraped_kb_articles/python-kernel-is-unresponsive-error-message.json new file mode 100644 index 0000000000000000000000000000000000000000..ea2961d2cd8bdd88135355ffdd77c1d2561b8fda --- /dev/null +++ b/scraped_kb_articles/python-kernel-is-unresponsive-error-message.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/python-kernel-is-unresponsive-error-message", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour job fails with a\nPython kernel is an unresponsive\nerror message.\nFatal error: The Python kernel is unresponsive.\nCause\nIf the cluster runs out of memory, the Python kernel can crash.\nThis usually happens when running memory-intensive operations with relatively small instances or when running multiple notebooks or jobs in parallel on the same cluster.\nSolution\nImplement the following strategies to address the unresponsive Python kernel issue:\nUse job clusters for non-interactive jobs instead of all-purpose clusters. Refrain from running batch jobs on an all-purpose cluster.\nEnsure that your cluster configuration employs the appropriate type and size to effectively manage the anticipated workload. Consider increasing the cluster size by adding more worker nodes or augmenting the memory capacity of existing nodes.\nOptimize the data pipeline to decrease the amount of data processed simultaneously.\nDistribute workloads across multiple clusters if multiple notebooks or jobs are running simultaneously on the same cluster. Regardless of the cluster's size, there is only one Apache Spark driver node, which cannot be distributed within the cluster.\nIf your operations are memory-intensive, verify that sufficient driver memory is available. Be cautious when using the following:\nThe\ncollect()\noperator, which transfers a large volume of data to the driver.\nConverting a substantial DataFrame to a pandas DataFrame.\nMonitor the cluster's performance using Ganglia metrics to identify potential issues and optimize resource usage." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-repl-fails-dcs.json b/scraped_kb_articles/python-repl-fails-dcs.json new file mode 100644 index 0000000000000000000000000000000000000000..cc85abc5c5cd1662fd8cc4e08133a998096c9df0 --- /dev/null +++ b/scraped_kb_articles/python-repl-fails-dcs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/python-repl-fails-dcs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use a Docker container that includes prebuilt Python libraries, Python commands fail and the virtual environment is not created. The following error message is visible in the driver logs.\n20/02/29 16:38:35 WARN PythonDriverWrapper: Failed to start repl ReplId-5b591-0ce42-78ef3-7\r\njava.io.IOException: Cannot run program \"/local_disk0/pythonVirtualEnvDirs/virtualEnv-56a5be60-3e71-486f-ac04-08e8f2491032/bin/python\" (in directory \".\"): error=2, No such file or directory\r\n        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)\r\n        at org.apache.spark.util.Utils$.executeCommand(Utils.scala:1367)\r\n        at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1393)\r\n        at org.apache.spark.util.Utils$.executePythonAndGetOutput(Utils.scala:\r\n…\r\n        at java.lang.Thread.run(Thread.java:748)\r\nCaused by: java.io.IOException: error=2, No such file or directory\r\n        at java.lang.UNIXProcess.forkAndExec(Native Method)\r\n        at java.lang.UNIXProcess.(UNIXProcess.java:247)\r\n        at java.lang.ProcessImpl.start(ProcessImpl.java:134)\r\n        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)\r\n        ... 17 more\nYou can confirm the issue by running the following command in a notebook:\n%sh\r\n\r\nvirtualenv --no-site-packages\nThe result is an error message similar to the following:\nusage: virtualenv [--version] [--with-traceback] [-v | -q] [--discovery {builtin}] [-p py] [--creator {builtin,cpython3-posix,venv}] [--seeder {app-data,pip}] [--no-seed] [--activators comma_separated_list] [--clear]\r\n                  [--system-site-packages] [--symlinks | --copies] [--download | --no-download] [--extra-search-dir d [d ...]] [--pip version] [--setuptools version] [--wheel version] [--no-pip] [--no-setuptools] [--no-wheel]\r\n                  [--clear-app-data] [--symlink-app-data] [--prompt prompt] [-h]\r\n                  dest\r\nvirtualenv: error: the following arguments are required: dest\nThe\nvirtualenv\ncommand does not recognize the\n--no-site-packages\noption.\nVersion\nThe problem affects all current Databricks Runtime versions, except for Databricks Runtime versions that include Conda. It affects\nvirtualenv\nlibrary version 20.0.0 and above.\nCause\nThis issue is caused by using a Python\nvirtualenv\nlibrary version in the Docker container that does not support the\n--no-site-packages\noption.\nDatabricks Runtime requires a\nvirtualenv\nlibrary that supports the\n--no-site-packages option\n. This option was removed in\nvirtualenv\nlibrary version 20.0.0 and above.\nYou can verify your\nvirtualenv\nlibrary version by running the following command in a notebook:\n%sh \r\n\r\nvirtualenv --version\nSolution\nYou can resolve the issue by specifying a compatible version when you install the\nvirtualenv\nlibrary.\nFor example, setting\nvirtualenv==16.0.0\nin the Dockerfile installs\nvirtualenv\nlibrary version 16.0.0. This version of the library supports the required option." +} \ No newline at end of file diff --git a/scraped_kb_articles/python-sdk-endpoint-not-found-error-when-trying-to-use-accountclient.json b/scraped_kb_articles/python-sdk-endpoint-not-found-error-when-trying-to-use-accountclient.json new file mode 100644 index 0000000000000000000000000000000000000000..cb6ba5797d21fbf132503dde5f04bd0b597aa487 --- /dev/null +++ b/scraped_kb_articles/python-sdk-endpoint-not-found-error-when-trying-to-use-accountclient.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/python-sdk-endpoint-not-found-error-when-trying-to-use-accountclient", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use\ndatabricks.sdk.AccountClient\nin a notebook, with default authentication in your workspace, when you get a\nNotFound: Endpoint not found\nerror message.\nExample\nUsing the Databricks SDK account client to list all of your account workspaces generates the following error:\nNotFound: Endpoint not found for /2.0/accounts//workspaces\nCause\nNotebook native authentication is chosen by default when using the SDK inside a Databricks notebook, unless you have specified a different way to authenticate in the account client.\nNotebook native authentication is not supported at the account level. It only authenticates at the workspace level.\nSolution\nYou must provide a valid way to authenticate at the account level before you can use\ndatabricks.sdk.AccountClient\nin a notebook.\nYou can use OAuth with a service principal as detailed in the\nAuthorize unattended access to Databricks resources with a service principal using OAuth\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nOnce you have the service principal client id and the secret, you can authenticate the SDK at the account level by calling\nAccountClient\nin a notebook.\nYou need to replace the following values:\n\n-  Your account URL depends on your cloud.\nAWS:\nhttps://accounts.cloud.databricks.com\nAzure:\nhttps://accounts.azuredatabricks.net\nGCP:\nhttps://accounts.gcp.databricks.com\n\n- You can find your account ID (\nAWS\n|\nAzure\n|\nGCP\n) in the account console.\n\nand\n\n- your service principal ID and secret from when you created the service principal.\na = AccountClient(\r\nhost=\"\",\r\naccount_id=',\r\nclient_id='',\r\nclient_secret=''\r\n)\nFor more information, review the\nDatabricks SDK for Python authentication\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/queries-to-access-files-on-a-unity-catalog-volume-fail-with-filenotfound-error.json b/scraped_kb_articles/queries-to-access-files-on-a-unity-catalog-volume-fail-with-filenotfound-error.json new file mode 100644 index 0000000000000000000000000000000000000000..1251fea7bbfd81687ea672ebdc96be42777e51af --- /dev/null +++ b/scraped_kb_articles/queries-to-access-files-on-a-unity-catalog-volume-fail-with-filenotfound-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/queries-to-access-files-on-a-unity-catalog-volume-fail-with-filenotfound-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile using Databricks apps, you notice you cannot run queries to access files on a Unity Catalog volume. You receive the following error.\nFileNotFoundError: [Errno 2] No such file or directory: '/Volumes///'\nYou also notice your queries are successful on general compute.\nCause\nYou are attempting to access the volume directly. Databricks apps do not support direct access to Unity Catalog volumes without using the SDK.\nSolution\nDownload the file from the Unity Catalog volume using the Databricks SDK and then read the file.\nFirst, install the Databricks SDK if it is not already installed.\n%sh\r\npip install databricks-sdk\nNext, use the following code to download the file and read its contents.\n%python\r\nfrom io import BytesIO\r\nimport pandas as pd\r\nfrom databricks_sdk import ServiceClient\r\n# Initialize the Databricks SDK client\r\nclient = ServiceClient(token='')\r\n# Download the file from the Unity Catalog volume\r\nresponse = client.files.download(\"/Volumes/\")\r\n# Read the file contents\r\nfile_content = BytesIO(response.contents.read())\r\ndf = pd.read_csv(file_content)\nFor more information, refer to the\nDevelop Databricks Apps\n(\nAWS\n|\nAzure\n) documentation.\nAlternatively, you can use the following code to authenticate the service principal using\nWorkspaceClient()\nand then download and read the file contents.\nfrom databricks.sdk import WorkspaceClient\r\nw = WorkspaceClient()\r\nvolume_contents = w.files.list_directory_contents(\"/Volumes/////\")\r\n# Obtain files from UC volume into local app memory \r\ndownloaded_files_contents = []\r\nfor file in volume_contents:\r\n  if not file.is_directory:\r\n    download_file = w.files.download(file.path)\r\n    downloaded_files_contents.append(download_file.contents.read())\nFor more information about authentication and\nw.files\n, refer to the\nw.files\n: Files\nand\nAuthentication\nSDK documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/query-not-skip-header-ext-table.json b/scraped_kb_articles/query-not-skip-header-ext-table.json new file mode 100644 index 0000000000000000000000000000000000000000..3e26fc3f9abda68b4edb2a6ddcee602d40a75dad --- /dev/null +++ b/scraped_kb_articles/query-not-skip-header-ext-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/query-not-skip-header-ext-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to query an external Hive table, but it keeps failing to skip the header row, even though\nTBLPROPERTIES ('skip.header.line.count'='1')\nis set in the HiveContext.\nYou can reproduce the issue by creating a table with this sample code.\n%sql\r\n\r\nCREATE EXTERNAL TABLE school_test_score (\r\n  `school` varchar(254),\r\n  `student_id` varchar(254),\r\n  `gender` varchar(254),\r\n  `pretest` varchar(254),\r\n  `posttest` varchar(254))\r\nROW FORMAT DELIMITED\r\n  FIELDS TERMINATED BY ','\r\n  LINES TERMINATED BY '\\n'\r\nSTORED AS INPUTFORMAT\r\n  'org.apache.hadoop.mapred.TextInputFormat'\r\nOUTPUTFORMAT\r\n  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'\r\nLOCATION\r\n  'dbfs:/FileStore/table_header/'\r\nTBLPROPERTIES (\r\n   'skip.header.line.count'='1'\r\n)\nIf you try to select the first five rows from the table, the first row is the header row.\n%sql\r\n\r\nSELECT * FROM school_test_score LIMIT 5\nCause\nIf you query directly from Hive, the header row is correctly skipped. Apache Spark does not recognize the\nskip.header.line.count\nproperty in HiveContext, so it does not skip the header row.\nSpark is behaving as designed.\nSolution\nYou need to use Spark options to create the table with a header option.\n%sql\r\n\r\nCREATE TABLE student_test_score (school String, student_id String, gender String, pretest String, posttest String) USING CSV\r\nOPTIONS (path \"dbfs:/FileStore/table_header/\",\r\n        delimiter \",\",\r\n        header \"true\")\r\n        ;\nSelect the first five rows from the table and the header row is not included.\n%sql\r\n\r\nSELECT * FROM school_test_score LIMIT 5" +} \ No newline at end of file diff --git a/scraped_kb_articles/query-on-classic-or-pro-sql-warehouse-failing-with-metadatafetchfailedexception.json b/scraped_kb_articles/query-on-classic-or-pro-sql-warehouse-failing-with-metadatafetchfailedexception.json new file mode 100644 index 0000000000000000000000000000000000000000..ee196d197e9618b21cc52b4673eec5563ad9cd16 --- /dev/null +++ b/scraped_kb_articles/query-on-classic-or-pro-sql-warehouse-failing-with-metadatafetchfailedexception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/query-on-classic-or-pro-sql-warehouse-failing-with-metadatafetchfailedexception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run SQL queries in Classic or Pro SQL warehouses, they fail with the following error message.\nMetadataFetchFailedException:\r\norg.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2549 partition 49\nCause\nWhile your query was executing, the spot instance terminated and the worker nodes were lost.\nWhen a worker node is terminated, any shuffle files stored on that node are also lost causing the stage to fail with a shuffle fetch failure.\nSolution\nUpdate your spot instance policy to\nReliability Optimized\n.\nNavigate to the SQL Warehouse where you executed the query.\nClick\nEdit\n.\nUnder\nAdvanced options\n, change the\nSpot Instance Policy\nfrom\nCost Optimized\nto\nReliability Optimized\n.\nThis setting ensures that all nodes are launched as on-demand instances, significantly reducing the risk of unexpected termination during query execution.\nNote\nSpot instance policy option selection is not available for serverless warehouses." +} \ No newline at end of file diff --git a/scraped_kb_articles/query-using-copy-into-using-a-direct-file-directory-pattern-fails-with-error-job-aborted-due-to-stage-failure-oom-error.json b/scraped_kb_articles/query-using-copy-into-using-a-direct-file-directory-pattern-fails-with-error-job-aborted-due-to-stage-failure-oom-error.json new file mode 100644 index 0000000000000000000000000000000000000000..34de2bdab6e01aa8902ded55187c3c91fd35404b --- /dev/null +++ b/scraped_kb_articles/query-using-copy-into-using-a-direct-file-directory-pattern-fails-with-error-job-aborted-due-to-stage-failure-oom-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/query-using-copy-into-using-a-direct-file-directory-pattern-fails-with-error-job-aborted-due-to-stage-failure-oom-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you need to load a high volume of small JSON files (for example, a million or more) into a Delta table, you have two query options. Query 1 uses\nINSERT INTO\nwith\nread_files()\nand succeeds.\nQuery 2 uses\nCOPY INTO\nusing a direct file directory pattern and fails with the following error.\nERROR: Job aborted due to stage failure: java.lang.OutOfMemoryError: Java heap space\"\nQuery 1 example\nINSERT INTO ..
(\r\n  SELECT\r\n    ,\r\n    ,\r\n    \r\n  FROM\r\n    read_files(\r\n      \"/Volumes/\",\r\n      schema =>\r\n        \" STRING, STRING, STRING\"\r\n    )\r\n)\nQuery 2 example\nCOPY INTO ..
FROM (\r\n  SELECT\r\n    ::STRING ,\r\n    ::STRING ,\r\n    ::STRING \r\n  FROM '/Volumes/'\r\n) FILEFORMAT = JSON PATTERN = '*.json' FORMAT_OPTIONS ('multiline' = 'true')\nCause\nThe core difference between the two SQL queries is how they handle file listings and metadata management.\nThe\nread_files()\nfunction reads files directly without relying on recursive directory listing at the driver level. It instead constructs the DataFrame using an explicitly provided schema and the specified file paths or prefix, minimizing driver memory operations. In the following image (which shows a test reproduction environment with dummy data), no file and directory listing operation appears.\nThe\nCOPY INTO\ncommand retrieves metadata about all files in the specified source directory/prefix to match the provided pattern,\nPATTERN\n, and applies format-specific options\nFILE FORMAT\nand\nFORMAT_OPTIONS\n.\nThe command uses the metadata service to populate a file index,\nInMemoryFileIndex\n, a process that happens in the driver’s memory. In the following image (which shows a test reproduction environment with dummy data), an additional Apache Spark job is launched to perform a file and directory listing operation. The additional job is highlighted with a red box.\nFor directories containing a large number of files, this additional operation (which runs purely on the driver) can lead to a Java heap out of memory (OOM) error.\nSolution\nYou can use\nINSERT INTO\nwith the\nread_files()\nfunction which succeeds. This approach avoids the driver-intensive operation of listing leaf files and directories, which can cause a Java heap OOM error.\nIf you need to use the\nCOPY INTO\ncommand, consider the following best practices to avoid Java heap OOM errors.\nIncrease the driver memory allocation in your cluster configuration.\nSplit the source data into smaller directories with fewer files to reduce the number of files processed during the `InMemoryFileIndex` operation.\nDesign your data layout with partitioning and optimal file sizes to minimize the impact of file listing operations on driver memory." +} \ No newline at end of file diff --git a/scraped_kb_articles/querying-systemaccessaudit-table-not-returning-expected-records.json b/scraped_kb_articles/querying-systemaccessaudit-table-not-returning-expected-records.json new file mode 100644 index 0000000000000000000000000000000000000000..2b5cbee5d02e5f86e2770a822a93fa0436e76166 --- /dev/null +++ b/scraped_kb_articles/querying-systemaccessaudit-table-not-returning-expected-records.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/querying-systemaccessaudit-table-not-returning-expected-records", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/r-commands-fail-on-custom-docker-cluster.json b/scraped_kb_articles/r-commands-fail-on-custom-docker-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..36434a6337f7f9082810f001895547d7923762e0 --- /dev/null +++ b/scraped_kb_articles/r-commands-fail-on-custom-docker-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/r-commands-fail-on-custom-docker-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to run R notebooks on a\ncustom Docker cluster\n(\nAWS\n|\nAzure\n), but they immediately fail.\nWhen you try to execute an R notebook, it returns an error saying the notebook was cancelled.\nWhen you review the\nCluster driver and worker logs\n(\nAWS\n|\nAzure\n) you see a\nthere is no package called 'Rserve'\nerror.\nTue Aug 30 16:24:34 UTC 2022 Starting R processing from BASH \r\n  \r\nTue Aug 30 16:24:34 UTC 2022 R script: /local_disk0/tmp/_rServeScript.r6851825576782071270resource.r \r\n  \r\nTue Aug 30 16:24:34 UTC 2022 Port number: 1108 \r\n  \r\nTue Aug 30 16:24:34 UTC 2022 cgroup: None \r\n  \r\n2022-08-30 16:24:34 R process started with pid 1462 \r\n  \r\nError in loadNamespace(x) : there is no package called 'Rserve' \r\n  \r\nCalls: loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart \r\n  \r\nExecution halted.\nWhen you check for the Python libraries, they are all present.\nWhen you check the R version in a notebook, it returns the version information so you know R is installed.\n%sh\r\n\r\nR --version\nR version 4.2.0 (2022-04-22) -- \"Vigorous Calisthenics\" \r\nCopyright (C) 2022 The R Foundation for Statistical Computing \r\nPlatform: x86_64-pc-linux-gnu (64-bit) \r\n  \r\nR is free software and comes with ABSOLUTELY NO WARRANTY. \r\nYou are welcome to redistribute it under the terms of the \r\nGNU General Public License versions 2 or 3. \r\nFor more information about these matters see \r\nhttps://www.gnu.org/licenses/.\nCause\nDatabricks Runtimes use R version 4.1.3 by default. If you start a standard cluster from the\nCompute\nmenu in the workspace and check the version, it returns R version 4.1.3.\nWhen you build a custom cluster with Docker, it is possible to use a different R version. In the example used here, we see that the custom Docker cluster is running R version 4.2.0.\nR version 4.2.0 changed the way\nRenviron.site\nis initialized, which implicitly modifies the behavior of\n--vanilla\n.\nSolution\nIf you want to use R version 4.2.0 on a custom Docker cluster with Databricks Runtime 11.3 and below, you must set the\nDATABRICKS_ENABLE_RPROFILE=true\nenvironment variable\n(\nAWS\n|\nAzure\n) on the cluster.\nIf you want to use R version 4.2.0 on a custom Docker cluster with Databricks Runtime 12.0 and above, you can use\nR session customization\n(\nAWS\n|\nAzure\n) to set\nDATABRICKS_ENABLE_RPROFILE=true\nin the\n.Rprofile\nfile.\nFor more information on installing R, please review the\nInstall RStudio Server Open Source Edition\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/random-split-behavior.json b/scraped_kb_articles/random-split-behavior.json new file mode 100644 index 0000000000000000000000000000000000000000..3487613d4df360aa9216e4fdc432e900b2c1d3f8 --- /dev/null +++ b/scraped_kb_articles/random-split-behavior.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/random-split-behavior", + "title": "Título do Artigo Desconhecido", + "content": "When using\nrandomSplit\non a DataFrame, you could potentially observe inconsistent behavior. Here is an example:\n%python\r\n\r\ndf = spark.read.format('inconsistent_data_source').load()\r\na,b = df.randomSplit([0.5, 0.5])\r\na.join(broadcast(b), on='id', how='inner').count()\nTypically this query returns\n0\n. However, depending on the underlying data source or input DataFrame, in some cases the query could result in more than 0 records.\nThis unexpected behavior is explained by the fact that data distribution across RDD partitions is not idempotent, and could be rearranged or updated during the query execution, thus affecting the output of the\nrandomSplit\nmethod.\nDelete\nInfo\nSpark DataFrames and RDDs preserve partitioning order; this problem only exists when query output depends on the actual data distribution across partitions, for example,\nvalues from files 1, 2 and 3 always appear in partition 1\n.\nThe issue could also be observed when using Delta cache (\nAWS\n|\nAzure\n|\nGCP\n). All solutions listed below are still applicable in this case.\nSolution\nDo one of the following:\nUse explicit Apache Spark RDD caching\n%python\r\n\r\ndf = inputDF.cache()\r\na,b = df.randomSplit([0.5, 0.5])\nRepartition by a column or a set of columns\n%python\r\n\r\ndf = inputDF.repartition(100, 'col1')\r\na,b = df.randomSplit([0.5, 0.5])\nApply an aggregate function\n%python\r\n\r\ndf = inputDF.groupBy('col1').count()\r\na,b = df.randomSplit([0.5, 0.5])\nThese operations persist or shuffle data resulting in the consistent data distribution across partitions in Spark jobs." +} \ No newline at end of file diff --git a/scraped_kb_articles/reactivate-a-user-that-has-been-disabled-with-aad-at-the-account-and-workspace-level.json b/scraped_kb_articles/reactivate-a-user-that-has-been-disabled-with-aad-at-the-account-and-workspace-level.json new file mode 100644 index 0000000000000000000000000000000000000000..56ab49f1222c1ae8326803d7e50a0d38f1c93653 --- /dev/null +++ b/scraped_kb_articles/reactivate-a-user-that-has-been-disabled-with-aad-at-the-account-and-workspace-level.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/reactivate-a-user-that-has-been-disabled-with-aad-at-the-account-and-workspace-level", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/reactivate-inactive-users-at-the-account-level-using-oauth-token-for-non-scim-scenarios.json b/scraped_kb_articles/reactivate-inactive-users-at-the-account-level-using-oauth-token-for-non-scim-scenarios.json new file mode 100644 index 0000000000000000000000000000000000000000..e1ac8cc4c676df191070279cba15c5ab33c24c0b --- /dev/null +++ b/scraped_kb_articles/reactivate-inactive-users-at-the-account-level-using-oauth-token-for-non-scim-scenarios.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/reactivate-inactive-users-at-the-account-level-using-oauth-token-for-non-scim-scenarios", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/read-fail-jdbc-dbr6x.json b/scraped_kb_articles/read-fail-jdbc-dbr6x.json new file mode 100644 index 0000000000000000000000000000000000000000..e00c72b8a9a078a6b36f28bf4d70956f00adf975 --- /dev/null +++ b/scraped_kb_articles/read-fail-jdbc-dbr6x.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/read-fail-jdbc-dbr6x", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAttempting to read external tables via JDBC works fine on Databricks Runtime 5.5, but the same table reads fail on Databricks Runtime 6.0 and above.\nYou see an error similar to the following:\ncom.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.jdbc does not allow user-specified schemas.\r\nat com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)\r\nat com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)\r\nat com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)\r\nat java.lang.Thread.run(Thread.java:748)\r\n.\r\nCaused by: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.jdbc does not allow user-specified schemas.;\r\n\r\nat org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)\nCause\nDatabricks Runtime 5.5 and below infers the\nsession_id\nattribute as a\nsmallint\n. Databricks Runtime 6.0 and above infers the\nsession_id\nattribute as an\nint\n.\nThis change to the\nsession_id\nattribute causes queries to fail with a schema issue.\nSolution\nIf you are using external tables that were created in Databricks Runtime 5.5 and below in Databricks Runtime 6.0 and above, you must set the Apache Spark configuration\nspark.sql.legacy.mssqlserver.numericMapping.enabled\nto\ntrue\n. This ensures that Databricks Runtime 6.0 and above infers the\nsession_id\nattribute as a\nsmallint\n.\nOpen the\nClusters\npage.\nSelect a cluster.\nClick\nEdit\n.\nClick\nAdvanced Options\n.\nClick\nSpark\n.\nIn the\nSpark config\nfield, enter\nspark.sql.legacy.mssqlserver.numericMapping.enabled true\n.\nSave the change and start, or restart, the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/reading-a-csv-file-in-dropmalformed-still-includes-malformed-rows-in-the-result.json b/scraped_kb_articles/reading-a-csv-file-in-dropmalformed-still-includes-malformed-rows-in-the-result.json new file mode 100644 index 0000000000000000000000000000000000000000..76f7fe89ec32dd1451fb7884e81cf12d04058131 --- /dev/null +++ b/scraped_kb_articles/reading-a-csv-file-in-dropmalformed-still-includes-malformed-rows-in-the-result.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/reading-a-csv-file-in-dropmalformed-still-includes-malformed-rows-in-the-result", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen reading a CSV file in\nDROPMALFORMED\nmode with the\n.schema\noption specified, functions such as\ndf.count()\nor\ndf.agg(count('*')).display()\nstill include malformed rows in the returned result.\nCause\nThe functions\ndf.count()\nand\ndf.agg(count('*')).display()\neach count the number of line breaks in a file without fully parsing each row according to the schema. For example, if you pass a string in and the schema is expecting an integer,\ndf.count\nstill counts it.\nSolution\nDatabricks recommends caching the DataFrame after reading it, and then calling the\ndf.count()\nfunction. This forces Apache Spark to fully parse the data and apply the\nDROPMALFORMED\nmode correctly." +} \ No newline at end of file diff --git a/scraped_kb_articles/reading-a-table-fails-due-to-aad-token-timeout-on-adls-gen2.json b/scraped_kb_articles/reading-a-table-fails-due-to-aad-token-timeout-on-adls-gen2.json new file mode 100644 index 0000000000000000000000000000000000000000..9a5405922dbcadfa5b37ec15d8fbd6503abeaa9b --- /dev/null +++ b/scraped_kb_articles/reading-a-table-fails-due-to-aad-token-timeout-on-adls-gen2.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/reading-a-table-fails-due-to-aad-token-timeout-on-adls-gen2", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAccess to ADLS Gen2 storage can be configured using OAuth 2.0 with an Azure service principal. You can securely access data in an Azure storage account using OAuth 2.0 with an Azure Active Directory (Azure AD) application service principal for authentication.\nYou are trying to access external tables (tables stored outside of the root storage location) which are stored on ADLS Gen2. Access fails with an\nADLException\nerror and an\nIOException : AADToken\ntimeout error.\nWARN DeltaLog: Failed to parse dbfs:/mnt/
. This may happen if there was an error during read operation, or a file appears to be partial. Sleeping and trying again.\r\ncom.microsoft.azure.datalake.store.ADLException: Error getting info for file
\r\nError fetching access tokenOperation null failed with exception java.io.IOException : AADToken: HTTP connection failed for getting token from AzureAD due to timeout. Client Request Id : Latency(ns) : 180152012\r\nLast encountered exception thrown after 5 tries. [java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException]\r\n[ServerRequestId:null]\r\nCaused by: java.io.IOException: Server returned HTTP response code: 401 for URL:\nhttps://login.microsoftonline.com/<\n;directory-id>/oauth2/token  \r\nat sun.reflect.GeneratedConstructorAccessor118.newInstance(Unknown Source)  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\r\nCaused by: java.io.IOException: Server returned HTTP response code: 401 for URL:\nhttps://login.microsoftonline.com/<\n;directory-id>/oauth2/token\r\n  at sun.reflect.GeneratedConstructorAccessor118.newInstance(Unknown Source)\r\n  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\nCause\nAccess to ADLS Gen2 storage fails if the client secret token associated with the Azure Active Directory (Azure AD) application service principal is expired or invalid.\nSolution\nReview the storage account access setup and verify that the client secret is expired. Create a new client secret token and then remount the ADLS Gen2 storage container using the new secret, or update the client secret token with the new secret in the ADLS Gen2 storage account configuration.\nReview existing storage account secrets\nCheck to see if the existing client secret is expired.\nOpen the\nAzure portal\n.\nClick\nAzure Active Directory\n.\nIn the menu on the left, look under\nManage\nand click\nApp registrations.\nOn the all applications tab, locate the application created for Azure Databricks. You can search the app registrations by Display name or by Application (client) ID.\nClick on your application.\nIn the menu on the left, look under\nManage\nand click\nCertificates & secrets\n.\nReview the\nClient secrets\nsection and check the date in the\nExpires\ncolumn.\nCreate a new secret token\nIf the existing client secret is expired, you must create a new token.\nClick\nNew client secret\n.\nEnter a description and a duration for the secret.\nClick\nAdd\n.\nThe client secret is displayed. Copy the\nValue\n. It cannot be retrieved after you leave the page.\nDelete\nWarning\nIf you forget to copy the secret\nValue\n, you must repeat these steps. The\nValue\ncannot be retrieved once you leave the page. Returning to the page displays a masked version of the\nValue\n.\nRemount ADLS Gen2 storage with new secret\nOnce you have generated a new client secret, you can unmount the existing ADLS Gen2 storage, update the secret information, and then remount the storage.\nUnmount the existing mount point.\n%python\r\n\r\ndbutils.fs.unmount(\"/mnt/\")\nReview the\ndbutils.fs.unmount\ndocumentation for more information.\nRemount the storage account with the new client secret.\nReplace\n\nwith the\nApplication (client) ID\nfor the Azure Active Directory application\n\nwith the name of the container\n\nwith the\nDirectory (tenant) ID\nfor the Azure Active Directory application\n\nwith the Databricks secret scope name\n\nwith the name of the key containing the client secret\n\nwith the name of the Azure storage account\n%python\r\n\r\nconfigs = {\"fs.azure.account.auth.type\": \"OAuth\",\r\n \"fs.azure.account.oauth.provider.type\": \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\",\r\n \"fs.azure.account.oauth2.client.id\": \"\",\r\n \"fs.azure.account.oauth2.client.secret\": dbutils.secrets.get(scope=\"\",key=\"\"),\r\n \"fs.azure.account.oauth2.client.endpoint\": \"https://login.microsoftonline.com//oauth2/token\"}\r\n\r\n# Optionally, you can add to the source URI of your mount point.\r\ndbutils.fs.mount(\r\n source = \"abfss://@.dfs.core.windows.net/\",\r\n mount_point = \"/mnt/\",\r\n  extra_configs = configs)\nReview the\nmount an Azure Blob storage container\ndocumentation for more information.\nReplace client secret in the storage account config\nAs an alternative to updating individual mounts, you can replace the client secret in the storage account authentication configuration. The storage account must be set up for direct access.\n%python\r\n\r\nspark.conf.set(\"fs.azure.account.auth.type..dfs.core.windows.net\", \"OAuth\")\r\nspark.conf.set(\"fs.azure.account.oauth.provider.type..dfs.core.windows.net\", \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\")\r\nspark.conf.set(\"fs.azure.account.oauth2.client.id..dfs.core.windows.net\", \"\")\r\nspark.conf.set(\"fs.azure.account.oauth2.client.secret..dfs.core.windows.net\", \"\")\r\nspark.conf.set(\"fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net\", \"https://login.microsoftonline.com//oauth2/token\")\nReplace\n\nwith the\nApplication (client) ID\nfor the Azure Active Directory application\n\nwith the\nDirectory (tenant) ID\nfor the Azure Active Directory application\n\nwith the name of the key containing the client secret\n\nwith the name of the Azure storage account\nReview the\naccess ADLS Gen2\ndocumentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/reading-avro-files-with-structured-streaming-using-wildcards-in-the-path-fails-with-error-arrayindexoutofboundsexception.json b/scraped_kb_articles/reading-avro-files-with-structured-streaming-using-wildcards-in-the-path-fails-with-error-arrayindexoutofboundsexception.json new file mode 100644 index 0000000000000000000000000000000000000000..a0ac036d3941d25194d5eb0814a17a357bc6e79d --- /dev/null +++ b/scraped_kb_articles/reading-avro-files-with-structured-streaming-using-wildcards-in-the-path-fails-with-error-arrayindexoutofboundsexception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/reading-avro-files-with-structured-streaming-using-wildcards-in-the-path-fails-with-error-arrayindexoutofboundsexception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to read Avro files with Structured Streaming using wildcards in the path, the read fails with an error.\njava.lang. ArrayIndexOutOfBoundsException: 0\r\nat org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:178)\nCause\nBy default, Databricks does not perform a recursive file lookup, which means that it will not read files in subdirectories in a specified path.\nSolution\nAdd\n`.option(\"recursiveFileLookup\", \"true\"`)\nto Apache Spark read commands. This option enables recursive file lookup, ensuring that Databricks reads files in subdirectories of the specified path.\nExample with Avro files\n```scala\r\nval df = spark\r\n .readStream\r\n .schema(sourceSchema)\r\n .option(\"recursiveFileLookup\", \"true\")\r\n .format(\"avro\")\r\n .load(basePath)\r\ndisplay(df)\r\n```\nExample with Parquet files\n```scala\r\nval df = spark\r\n .readStream\r\n .schema(sourceSchema)\r\n .option(\"recursiveFileLookup\", \"true\")\r\n .parquet(basePath)\r\ndisplay(df)\r\n```" +} \ No newline at end of file diff --git a/scraped_kb_articles/readstream-is-not-whitelisted.json b/scraped_kb_articles/readstream-is-not-whitelisted.json new file mode 100644 index 0000000000000000000000000000000000000000..4711d19a842cc32507d8de95562d570690c5ea85 --- /dev/null +++ b/scraped_kb_articles/readstream-is-not-whitelisted.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/readstream-is-not-whitelisted", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have table access control (\nAWS\n|\nAzure\n|\nGCP\n) enabled on your cluster.\nYou are trying to run a structured streaming query and get and error message.\npy4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.spark.sql.SQLContext\nCause\nStreaming is not supported on clusters that have table access control enabled.\nAccess control allows you to set permissions for data objects on a cluster. It requires user interaction to validate and refresh credentials.\nBecause streaming queries run continuously, it is not supported on clusters with table access control.\nSolution\nYou should use a cluster that does not have table access control enabled for streaming queries." +} \ No newline at end of file diff --git a/scraped_kb_articles/receiving-a-cudnn-version-mismatch-error-when-running-tensorflow-within-a-16-3-ml-runtime-environment.json b/scraped_kb_articles/receiving-a-cudnn-version-mismatch-error-when-running-tensorflow-within-a-16-3-ml-runtime-environment.json new file mode 100644 index 0000000000000000000000000000000000000000..871599e308ebe9d19240391f8a5a5c39a94951d3 --- /dev/null +++ b/scraped_kb_articles/receiving-a-cudnn-version-mismatch-error-when-running-tensorflow-within-a-16-3-ml-runtime-environment.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/receiving-a-cudnn-version-mismatch-error-when-running-tensorflow-within-a-16-3-ml-runtime-environment", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you run TensorFlow within a 16.3 ML runtime environment, you receive a CuDNN version mismatch error such as the following. The error prevents proper initialization of the DNN library and results in the failed execution of TensorFlow operations involving GPU acceleration.\nEXXXXXX cuda_dnn.cc:XXX] Loaded runtime CuDNN library: 9.1.0 but source was compiled with: 9.3.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. EXXXXXX cuda_dnn.cc:XXX] Loaded runtime CuDNN library: 9.1.0 but source was compiled with: 9.3.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. : W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at xla_ops.cc:XXX : FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details. : I tensorflow/core/framework/local_rendezvous.cc:XXX] Local rendezvous is aborting with status: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details. [[{{node StatefulPartitionedCall}}]]\nCause\nThis is a conflict between TensorFlow and PyTorch related to CuDNN library versions in a shared environment.\nTensorFlow 2.18 requires CuDNN 9.3 but PyTorch 2.4, when installed using pip, bundles its own CuDNN 9.1. When both frameworks are installed together, the PyTorch-linked CuDNN is given preference. TensorFlow then fails because the CuDNN version loaded into memory (9.1) is older than what it was compiled with (9.3).\nSolution\nRun the following command in a notebook to downgrade TensorFlow to a version that is compatible with CuDNN 9.1. TensorFlow 2.17 compiles against CuDNN 8.9, which is compatible with the Databricks runtime-provided libraries.\npip install tensorflow[and-cuda]==2.17.0" +} \ No newline at end of file diff --git a/scraped_kb_articles/receiving-com-databricks-sql-io-filereadexception-error-while-reading-file-on-streaming-queries.json b/scraped_kb_articles/receiving-com-databricks-sql-io-filereadexception-error-while-reading-file-on-streaming-queries.json new file mode 100644 index 0000000000000000000000000000000000000000..3cb06a250ea76f62f424a0cd32ec85eb7c2acfe3 --- /dev/null +++ b/scraped_kb_articles/receiving-com-databricks-sql-io-filereadexception-error-while-reading-file-on-streaming-queries.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/receiving-com-databricks-sql-io-filereadexception-error-while-reading-file-on-streaming-queries", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour streaming query fails when reading from a Delta table with a\ncom.databricks.sql.io.FileReadException: Error while reading file\nerror message.\nExample error\ncom.databricks.sql.io.FileReadException: Error while reading file {file_path} \r\nFile {file_path} referenced in the transaction log cannot be found. This occurs when data has been manually deleted\r\nfrom the file system rather than using the table `DELETE` statement. For more information,\r\nsee https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions\nCause\nThis can occur when an Apache Spark task tries to read a source file that no longer exists, or if the streaming query takes longer than the time specified in\ndelta.deletedFileRetentionDuration\n(default value 7 days).\nIt can also happen if the file was manually deleted.\nSolution\nOptimize your streaming query so it completes in less time or increase the\ndelta.deletedFileRetentionDuration\nvalue so it is at least one day longer than the time it takes your query to complete.\nFor more information on the\ndelta.deletedFileRetentionDuration\nproperty, review the\nWork with Delta Lake table history\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/recover-a-dropped-table-when-a-new-table-is-created-with-the-same-name.json b/scraped_kb_articles/recover-a-dropped-table-when-a-new-table-is-created-with-the-same-name.json new file mode 100644 index 0000000000000000000000000000000000000000..7926043b53b79ffde66259d0bd30097aa2fdd476 --- /dev/null +++ b/scraped_kb_articles/recover-a-dropped-table-when-a-new-table-is-created-with-the-same-name.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/recover-a-dropped-table-when-a-new-table-is-created-with-the-same-name", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou dropped and recreated a new Delta table, resulting in the loss of table versions and history. You want to recover the lost data with its history. However since there is another table with the same name, the\nUNDROP\nmethod does not work and gives you an error message stating that the table already exists.\nUnityCatalogServiceException: [RequestId=*******-****-****-***-*********** ErrorClass=TABLE_ALREADY_EXISTS. UNDROP_TABLE_ALREADY_EXISTS] Cannot undrop table because a table with the name catalogname.schemaname.tablename already exists.\nCause\nThe\nUNDROP\n(\nAWS\n|\nAzure\n|\nGCP\n) SQL command does not work since a workspace cannot have two tables with the same name.\nSolution\nTo recover the table history (within a 7 day retention period):\nRename the newly created table to a different name to avoid name conflicts.\nALTER TABLE .. RENAME TO ..;\nUse the\nUNDROP TABLE\ncommand to recover the original table.\nUNDROP TABLE ..;\nUse\nDESCRIBE HISTORY\nto check the history of the recovered table.\nDESCRIBE HISTORY ..;\nIf the new table is already in use, copy the latest data from the new table to the recovered table.\nINSERT INTO .. SELECT * FROM ..;\nNote\nIf there are multiple dropped relations of the same name, you can use\nSHOW TABLES DROPPED\nto identify the table ID and use\nUNDROP TABLE WITH ID\nto recover a specific relation." +} \ No newline at end of file diff --git a/scraped_kb_articles/recreate-listagg-functionality-with-spark-sql.json b/scraped_kb_articles/recreate-listagg-functionality-with-spark-sql.json new file mode 100644 index 0000000000000000000000000000000000000000..0d29ee926330503aba719494a1475c24055d16ce --- /dev/null +++ b/scraped_kb_articles/recreate-listagg-functionality-with-spark-sql.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/recreate-listagg-functionality-with-spark-sql", + "title": "Título do Artigo Desconhecido", + "content": "LISTAGG\nis a function that aggregates a set of string elements into one string by concatenating the strings. An optional separator string can be provided which is inserted between contiguous input strings.\nLISTAGG(, ) WITHIN GROUP(ORDER BY …)\nLISTAGG\nis supported in many databases and data warehouses. However, it is not natively supported in Apache Spark SQL or Databricks.\nYou can obtain similar functionality in Databricks by using\ncollect_list\n(\nAWS\n|\nAzure\n|\nGCP\n) with\nconcat_ws\n(\nAWS\n|\nAzure\n|\nGCP\n).\nInstructions\nBefore using the commands, we first need to create sample data to process.\nThis sample code creates a table (named\ntable1\n) that contains five lines of sample employee data.\n%python\r\n\r\ndata = [[\"James\",\"A\",\"Smith\",\"2018\",\"M\",3000],\r\n [\"Michael\",\"Rose\",\"Jones\",\"2010\",\"M\",4000],\r\n [\"Robert\",\"K\",\"Williams\",\"2010\",\"M\",5000],\r\n [\"Maria\",\"Anne\",\"Jones\",\"2005\",\"F\",4000],\r\n [\"Jen\",\"Mary\",\"Brown\",\"2010\",\"F\",6000]\r\n ]\r\n\r\ndf = spark.createDataFrame(data, [\"fname\",\"mname\",\"lname\",\"dob_year\",\"gender\",\"salary\"])\r\ndf.createOrReplaceTempView(\"table1\")\nThe sample table looks like this when viewed:\nIf you wanted to use\nLISTAGG\nto display a list of salaries by gender, you would use a query like this:\n%sql\r\n\r\nSELECT gender, LISTAGG(salary, ',') WITHIN GROUP(ORDER BY salary)\r\nFROM table1\r\nGROUP BY gender\nThe resulting table has two rows, with salary values separated by gender.\nTo replicate this functionality in Databricks, you need to use\ncollect_list\nand\nconcat_ws\n.\ncollect_list\ncreates a list of objects for the aggregated column. In this example, it gets the list of salary values for the aggregated gender column.\nconcat_ws\nconverts a list of salary objects to a single string value containing comma separated salaries.\nThis Spark SQL query returns the same result that you would get with\nLISTAGG\non a different database.\n%sql\r\n\r\nSELECT gender,CONCAT_WS(',', COLLECT_LIST(salary)) as concatenated_salary\r\nFROM table1\r\nGROUP BY gender;\nThe resulting table has two rows, with salary values separated by gender.\nIf you wanted to use\nLISTAGG\nto display the salary results in a descending order, you might write a query like this:\n%sql\r\n\r\nSELECT gender, LISTAGG(salary, ',') WITHIN GROUP(ORDER BY salary DESC)\r\nFROM table1\r\nGROUP BY gender\nTo do the same in Databricks, you would add\nsort_array\nto the previous Spark SQL example.\ncollect_list\nand\nconcat_ws\ndo the job of\nLISTAGG\n, while\nsort_array\nis used to output the salary results in a descending order.\n%sql\r\n\r\nSELECT gender,CONCAT_WS(',', SORT_ARRAY(COLLECT_LIST(salary), false)) as concatenated_salary\r\nFROM table1\r\nGROUP BY gender;\nBoth sets of sample code return the same output, with salary values separated by gender and displayed in descending order." +} \ No newline at end of file diff --git a/scraped_kb_articles/recurring-apache-spark-jobs-with-same-data-set-size-and-cluster-configuration-vary-in-duration.json b/scraped_kb_articles/recurring-apache-spark-jobs-with-same-data-set-size-and-cluster-configuration-vary-in-duration.json new file mode 100644 index 0000000000000000000000000000000000000000..f5f49f20d70a997fdd8434afaa1a4f34b1ff7d7a --- /dev/null +++ b/scraped_kb_articles/recurring-apache-spark-jobs-with-same-data-set-size-and-cluster-configuration-vary-in-duration.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/recurring-apache-spark-jobs-with-same-data-set-size-and-cluster-configuration-vary-in-duration", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nRegularly recurring Apache Spark jobs vary in duration despite using the same cluster configuration and approximately the same data set size. You notice run to run duration can be twice as long or more. Your execution plan is the same for every run and your cluster metrics don’t indicate any performance issue.\nCause\nYour initial cluster type choice initially provides a local solid state drive (SSD) disk. However, if the disk expands, the cluster sometimes uses a hard disk drive (HDD) instead of an SSD.  Because an HDD disk is slower than an SSD, local disk data processing is slower, which increases job duration. The cluster metrics don’t report the local disk throughput speed.\nSolution\nNavigate to the job’s cluster details link to review the event log and check for a disk expansion report for Spark jobs with unexpected longer durations.\nCompare the event log to normal duration job logs.  If you see disk expansion only for the longer job durations, the expansion disks are of type HDD.\nAfter confirming disk expansion only occurs in the jobs with longer durations:\nEnsure that the cluster type used has sufficient SSD memory, avoiding the need for disk expansion. Select a cluster configuration that provides ample SSD storage.\nRegularly monitor your clusters’ disk usage to ensure that they are not expanding to HDDs. You can use the Databricks UI or use your cloud provider’s tools for tracking disk types and usage.\nWhere possible, optimize your data storage to reduce the need for disk expansion. This can involve compressing data, partitioning data more effectively, using\nOPTIMIZE\nregularly on Delta tables, or using a more compact compression type." +} \ No newline at end of file diff --git a/scraped_kb_articles/recursive-references-in-avro-schema-are-not-allowed.json b/scraped_kb_articles/recursive-references-in-avro-schema-are-not-allowed.json new file mode 100644 index 0000000000000000000000000000000000000000..e5e205ac3006f1f282bfddc43315eab0f3b247ae --- /dev/null +++ b/scraped_kb_articles/recursive-references-in-avro-schema-are-not-allowed.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/recursive-references-in-avro-schema-are-not-allowed", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nApache Spark returns an error when trying to read from an\nApache Avro data source\nif the Avro schema has a recursive reference.\norg.apache.spark.sql.avro.IncompatibleSchemaException:\r\nFound recursive reference in Avro schema, which can not be processed by Spark\nCause\nSpark SQL does not support recursive references in an Avro data source.\nSolution\nAvoid using recursive references in your Avro schema.\nTest for recursive references\nYou can test your Avro schema for recursive references with\nSchemaConverters.toSqlType()\n.\n%scala\r\n\r\nimport org.apache.spark.sql.avro.SchemaConverters\r\nSchemaConverters.toSqlType()\nIf the Avro schema contains recursive references,\nSchemaConverters.toSqlType\nreturns an error.\nExample\nCreate an Avro schema with a recursive reference.\n%scala\r\n\r\nimport org.apache.avro.Schema\r\nval schema = new Schema.Parser().parse(\"\"\"{\r\n  \"type\": \"record\",\r\n  \"name\": \"LongList\",\r\n  \"aliases\": [\"LinkedLongs\"],                     \r\n  \"fields\" : [\r\n    {\"name\": \"value\", \"type\": \"long\"},             \r\n    {\"name\": \"next\", \"type\": [\"null\", \"LongList\"]} \r\n  ]\r\n}\"\"\")\nTest the schema with\nSchemaConverters.toSqlType\n.\n%scala\r\n\r\nimport org.apache.spark.sql.avro.SchemaConverters\r\nSchemaConverters.toSqlType(schema)\nIt returns an\nIncompatibleSchemaException\nerror.\nIncompatibleSchemaException: Found recursive reference in Avro schema, which can not be processed by Spark: {  \"type\" : \"record\",  \"name\" : \"LongList\",  \"fields\" : [ {    \"name\" : \"value\",    \"type\" : \"long\"  }, {    \"name\" : \"next\",    \"type\" : [ \"null\", \"LongList\" ]  } ],  \"aliases\" : [ \"LinkedLongs\" ] }" +} \ No newline at end of file diff --git a/scraped_kb_articles/regular-expression-regex-not-filtering-as-expected-when-using-alnum-and-digit-in-the-sql-query.json b/scraped_kb_articles/regular-expression-regex-not-filtering-as-expected-when-using-alnum-and-digit-in-the-sql-query.json new file mode 100644 index 0000000000000000000000000000000000000000..4d42118e9b174ab3f9c84c0924f5c2d645c399e0 --- /dev/null +++ b/scraped_kb_articles/regular-expression-regex-not-filtering-as-expected-when-using-alnum-and-digit-in-the-sql-query.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/regular-expression-regex-not-filtering-as-expected-when-using-alnum-and-digit-in-the-sql-query", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use\n[:alnum:]\nor\n[:digit:]\nin your regular expressions (regex) in an SQL query, you notice the regex does not filter the data correctly.\nFor example, in the following image, the query\nSELECT * from where REGEXP '[:alnum:]'\nreturns two rows, not the expected four.\nCause\nThis discrepancy lies in the difference between Photon and non-Photon compute. Photon uses the RE2 regex library, which interprets\n[:alnum:]\nas a character class meaning \"all alphanumeric characters.\"\nIn contrast, Apache Spark relies on the Java regex library, which simply treats\n[:alnum:]\nas a pattern matching the literal characters \":\", \"a\", \"l\", and so on.\nSolution\nUse\n\\p{Alnum}\nor\n\\p{Digit}\ninstead of POSIX-exclusive syntax.\nThe previous example query, rewritten as\nSELECT * from where REGEXP ‘\\\\p{Alnum}’\n, returns the expected four rows." +} \ No newline at end of file diff --git a/scraped_kb_articles/remount-storage-after-rotate-access-key.json b/scraped_kb_articles/remount-storage-after-rotate-access-key.json new file mode 100644 index 0000000000000000000000000000000000000000..f39919f4cdb4d8d9fa14a25500d094b2bc63e447 --- /dev/null +++ b/scraped_kb_articles/remount-storage-after-rotate-access-key.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/remount-storage-after-rotate-access-key", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have blob storage associated with a storage account mounted, but are unable to access it after access keys are rotated.\nCause\nThere are multiple mount points using the same storage account.\nRemounting some, but not all, of the mount points with new access keys results in access issues.\nSolution\nUse\ndbutils.fs.mounts()\nto check all mount points. Review the\ndbutils.fs.mounts() documentation\nfor usage details.\nUse\ndbutils.fs.unmount()\nto unmount all storage accounts. Review the\ndbutils.fs.unmount() documentation\nfor usage details.\nRestart the cluster.\nRemount the storage account with new keys. Review the\nAzure Data Lake Storage Gen2 and Blob Storage\ndocumentation for usage details." +} \ No newline at end of file diff --git a/scraped_kb_articles/remove-log4j1x-jmsappender-socketserver-classes.json b/scraped_kb_articles/remove-log4j1x-jmsappender-socketserver-classes.json new file mode 100644 index 0000000000000000000000000000000000000000..2350b4c8411f7e17fa5ab89ee69061745f5d5dfa --- /dev/null +++ b/scraped_kb_articles/remove-log4j1x-jmsappender-socketserver-classes.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/remove-log4j1x-jmsappender-socketserver-classes", + "title": "Título do Artigo Desconhecido", + "content": "Databricks recently published a blog on\nLog4j 2 Vulnerability (CVE-2021-44228) Research and Assessment\n. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable.\nDatabricks also does not use the affected classes from Log4j 1.x with known vulnerabilities (\nCVE-2021-4104\n,\nCVE-2020-9488\n, and\nCVE-2019-17571\n). However, if your code uses one of these classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities.\nIf your code uses Log4j, you should upgrade to Log4j 2.17 or above.\nIf you cannot upgrade for technical reasons, you can use a global init script (\nAWS\n|\nAzure\n|\nGCP\n) to strip the affected classes from Log4j on cluster start.\nDelete\nWarning\nBecause we do not control the code you run, we cannot guarantee that this solution will prevent Log4j from loading the affected classes in all cases.\nConfigure the global init script\nDelete\nInfo\nRunning this script is a breaking change for any code that relies on the affected classes.\nAWS\nGo to the Admin Console and click the\nGlobal Init Scripts\ntab.\nClick the\n+ Add\nbutton.\nEnter the name of the script.\nCopy the following script into the\nScript\nfield.\n%sh\r\n\r\n#!/bin/bash\r\n\r\necho 'Init script to remove certain Log4J 1.x classes, version 1.0 (2021-12-17)'\r\n\r\nFILES_TO_DELETE=(\r\n  org/apache/log4j/net/JMSAppender.class\r\n  org/apache/log4j/net/SocketServer.class\r\n)\r\n\r\nfind \"/databricks\" \\\r\n    -name '*log4j*.jar' \\\r\n    -exec echo -e \"\\nProcessing {}\" \\; -exec zip -d {} \"${FILES_TO_DELETE[@]}\" \\;\r\n\r\nexit 0\nIf you have more than one global init script configured for your workspace, you should configure this script to run after your other scripts.\nEnsure the\nEnabled\nswitch is toggled on.\nClick\nAdd\n.\nRestart ALL running clusters.\nDelete\nAzure\nGo to the Admin Console and click the\nGlobal Init Scripts\ntab.\nClick the\n+ Add\nbutton.\nEnter the name of the script.\nCopy the following script into the\nScript\nfield.\n%sh\r\n\r\n#!/bin/bash\r\n\r\necho 'Init script to remove certain Log4J 1.x classes, version 1.0 (2021-12-17)'\r\n\r\nFILES_TO_DELETE=(\r\n  org/apache/log4j/net/JMSAppender.class\r\n  org/apache/log4j/net/SocketServer.class\r\n)\r\n\r\nfind \"/databricks\" \\\r\n    -name '*log4j*.jar' \\\r\n    -exec echo -e \"\\nProcessing {}\" \\; -exec zip -d {} \"${FILES_TO_DELETE[@]}\" \\;\r\n\r\nexit 0\nIf you have more than one global init script configured for your workspace, you should configure this script to run after your other scripts.\nEnsure the\nEnabled\nswitch is toggled on.\nClick\nAdd\n.\nRestart ALL running clusters.\nDelete\nGCP\nUse the\nGlobal Init Scripts API 2.0\nto apply the following init script to every cluster in your workspace.\n%sh\r\n\r\n#!/bin/bash\r\n\r\necho 'Init script to remove certain Log4J 1.x classes, version 1.0 (2021-12-17)'\r\n\r\nFILES_TO_DELETE=(\r\n  org/apache/log4j/net/JMSAppender.class\r\n  org/apache/log4j/net/SocketServer.class\r\n)\r\n\r\nfind \"/databricks\" \\\r\n    -name '*log4j*.jar' \\\r\n    -exec echo -e \"\\nProcessing {}\" \\; -exec zip -d {} \"${FILES_TO_DELETE[@]}\" \\;\r\n\r\nexit 0\nRestart ALL running clusters after applying the global init script.\nDelete\nVerify the affected classes are not available\nYou should run a test on each cluster to ensure the affected classes are not available.\nTest 1\nYou can run an assert check on the affected classes in a notebook.\n%scala\r\n\r\nassert(this.getClass.getClassLoader().getResource(\"org/apache/log4j/net/JMSAppender.class\") == null)\r\nassert(this.getClass.getClassLoader().getResource(\"org/apache/log4j/net/SocketServer.class\") == null)\nThis sample code runs successfully if you have disabled the affected classes.\nThis sample code should return an error if you have NOT disabled the affected classes.\nTest 2\nYou can attempt to import the affected classes into a notebook.\n%scala\r\n\r\nimport org.apache.log4j.net.JMSAppender\r\nimport org.apache.log4j.net.SocketServer\nThis sample code runs successfully if you have NOT disabled the affected classes.\nThis sample code should return an error if you have disabled the affected classes.\nCaveats\nThere are some corner cases where you can re-introduce the Log4j 1.x versions of JMSAppender or SocketServer.\nProblem\nIf you install a Maven library with a transitive dependency on Log4j 1.x, all of its classes are re-added to the classpath.\nSolution\nYou can work around this issue by adding Log4j to the\nExclusions\nfield when installing Maven libraries.\nProblem\nIf you configure an external Apache Hive metastore, Apache Spark uses Ivy to resolve and download the correct metastore client library, and all of its transitive dependencies, possibly including Log4j 1.x.\nTo speed up cluster launch, you can cache the downloaded jars on DBFS and use an init script to install from the cache. If you cache jars like this, it is possible that Log4j 1.x may be included.\nSolution\nYou can configure the init script for your external metastore to delete the affected classes." +} \ No newline at end of file diff --git a/scraped_kb_articles/replay-cluster-spark-events.json b/scraped_kb_articles/replay-cluster-spark-events.json new file mode 100644 index 0000000000000000000000000000000000000000..f65555f138856f3799fa8f4fbe216cb1c1d8aeaf --- /dev/null +++ b/scraped_kb_articles/replay-cluster-spark-events.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/replay-cluster-spark-events", + "title": "Título do Artigo Desconhecido", + "content": "The Spark UI is commonly used as a debugging tool for Spark jobs.\nIf the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay notebook to replay the Spark events.\nDelete\nWarning\nCluster log delivery\nis not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise there will be no logs to replay.\nFollow the documentation to configure\nCluster log delivery\non your cluster.\nThe location of the cluster logs depends on the\nCluster Log Path\nthat you set during cluster configuration.\nFor example, if the log path is\ndbfs:/cluster-logs\n, the log files for a specific cluster will be stored in\ndbfs:/cluster-logs/\nand the individual event logs will be stored in\ndbfs:/cluster-logs//eventlog///\n.\nDelete\nNote\nThis example uses DBFS for cluster logs, but that is not a requirement. You can store cluster logs in DBFS or S3 storage.\nConfirm cluster logs exist\nReview the cluster log path and verify that logs are being written for your chosen cluster. Log files are written every five minutes.\nLaunch a single node cluster\nLaunch a single node cluster. You will replay the logs on this cluster.\nSelect the instance type based on the size of the event logs that you want to replay.\nRun the Event Log Replay notebook\nAttach the Event Log Replay notebook to the single node cluster.\nEnter the path to your chosen cluster event logs in the event_log_path field in the notebook.\nRun the notebook.\nEvent Log Replay notebook\nOpen notebook in a new tab.\nPrevent items getting dropped from the UI\nIf you have a long-running cluster, it is possible for some jobs and/or stages to get dropped from the Spark UI.\nThis happens due to default UI limits that are intended to prevent the UI from using up too much memory and causing an out-of-memory error on the cluster.\nIf you are using a single node cluster to replay the event logs, you can increase the default UI limits and devote more memory to the Spark UI. This prevents items from getting dropped.\nYou can adjust these values during cluster creation by editing the\nSpark Config\n.\nThis example contains the default values for these properties.\nspark.ui.retainedJobs 1000\r\nspark.ui.retainedStages 1000\r\nspark.ui.retainedTasks 100000\r\nspark.sql.ui.retainedExecutions 1000" +} \ No newline at end of file diff --git a/scraped_kb_articles/resolve-invalid-cast-input-error-on-serverless-compute.json b/scraped_kb_articles/resolve-invalid-cast-input-error-on-serverless-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..7c20dbeb07eda3baef6fa21508c843f236f7f4b5 --- /dev/null +++ b/scraped_kb_articles/resolve-invalid-cast-input-error-on-serverless-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/resolve-invalid-cast-input-error-on-serverless-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running jobs on serverless compute when you get an invalid cast input error message. When you run the same jobs on classic compute, you do not get an invalid cast input error.\nExample error message\n[CAST_INVALID_INPUT] The value '100.0' of the type \"STRING\" cannot be cast to \"BIGINT\" because it is malformed. Correct the value as per the syntax, or change its target type. Use 'try_cast' to tolerate malformed input and return NULL instead.\nCause\nThe default setting of\nspark.sql.ansi.enabled\nis set to\ntrue\non serverless compute, which enforces strict ANSI SQL compliance. When you start a classic cluster with Databricks Runtime,\nspark.sql.ansi.enabled\nis set to\nfalse\nby default. This difference in settings can lead to casting errors when running jobs on serverless compute.\nSolution\nSet\nspark.sql.ansi.enabled\nto\nfalse\nbefore running queries on serverless compute. This disables strict ANSI SQL compliance.\nRun\nSET spark.sql.ansi.enabled=false\nto disable strict ANSI SQL compliance.\nRun your query." +} \ No newline at end of file diff --git a/scraped_kb_articles/resolve-spark-directory-structure-conflicts.json b/scraped_kb_articles/resolve-spark-directory-structure-conflicts.json new file mode 100644 index 0000000000000000000000000000000000000000..d693b040e8b5cafa8d900504737974725d87db9b --- /dev/null +++ b/scraped_kb_articles/resolve-spark-directory-structure-conflicts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/resolve-spark-directory-structure-conflicts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour Databricks workflow fails due to an internal Apache Spark assertion error.\nExample code\nThis example code results in an assertion error.\n%python\r\n\r\n# Save the dataset to the table path (non-partitioned)\r\ndf.write.mode(\"overwrite\").parquet(\"dbfs:/FileStore/Jayant/tableDir\")\r\n\r\n# Save the dataset again to a subdirectory path using `partitionBy` (partitioned)\r\ndf.write.mode(\"overwrite\").partitionBy(\"year\", \"month\").parquet(\"dbfs:/FileStore/Jayant/tableDir/2025/04\")\r\n\r\n# Reading this table using Spark results in an assertion error\r\nspark.read.parquet(\"dbfs:/FileStore/Jayant/tableDir\").display()\nError message\njava.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:\r\ndbfs:/filestore/jayant/tabledir\r\ndbfs:/filestore/jayant/tabledir/2025/04\r\n\r\nIf provided paths are partition directories, please set \"basePath\" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.\r\nat scala.Predef$.assert(Predef.scala:223)\r\nat org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:316)\r\nat org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:155)\r\nat org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:205)\r\nat org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:110)\r\nat org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:58)\r\nat org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:205)\r\nat org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:494)\r\nat org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:394)\r\nat org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:350)\r\nat scala.Option.getOrElse(Option.scala:189)\r\nat org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:350)\r\nat org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:871)\nCause\nThis issue occurs when reading a Spark table which has been created with spark partitioning but with ambiguous sub-directories under a table directory.\nThe error is caused by a failure in Spark’s partition discovery logic while attempting to infer the schema of a directory structure with inconsistent layouts.\nWhen you read data using Spark, it attempts to automatically infer partitioning by parsing the input paths. This is handled internally by\norg.apache.spark.sql.execution.datasources.PartitioningUtils.parsePartitions\n. For more information, review the\nsource code\n.\nIn the reported error, Spark detects two distinct paths:\ndbfs:/filestore/jayant/tabledir\ndbfs:/filestore/jayant/tabledir/2025/04\nThese paths represent conflicting directory structures: one appears unpartitioned, and the other resembles a partitioned layout based on path depth. But since the folder names (\n2025\n,\n04\n) are not in Hive style\nkey=value\nformat, Spark cannot map them to valid partition column names.\nThis ambiguity leads Spark to fail an internal assertion.\nSolution\nWhen reading partitioned data without partition column names in the path, set\nbasePath\nto the common root to correctly infer partitioning.\nExample code\nThis example code uses\nbasePath\nso Spark can correctly read both partitions and return an output.\n%python\r\n\r\nspark.read.option(\"basePath\", \"dbfs:/FileStore/Jayant/tableDir\").parquet(\"dbfs:/FileStore/Jayant/tableDir/2025/04\").display()\nPreventive measures\nAvoid mixing non-partitioned files and partitioned subdirectories under the same path when working with inference. Ensure that your table and partition directories adhere to a consistent format without extraneous directories.\nIf you want to create a table with partitions, always use Spark partitioning with\npartitionBy()\nand keep the write path to be the root table directory." +} \ No newline at end of file diff --git a/scraped_kb_articles/resource_does_not_exist-error-when-creating-a-managed-table.json b/scraped_kb_articles/resource_does_not_exist-error-when-creating-a-managed-table.json new file mode 100644 index 0000000000000000000000000000000000000000..3424a785f4daf22b3add9a706df188f621df0ee9 --- /dev/null +++ b/scraped_kb_articles/resource_does_not_exist-error-when-creating-a-managed-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/resource_does_not_exist-error-when-creating-a-managed-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to create a managed table in a location outside the default metastore (root storage) in your workspace when you get an error that says the resource does not exist.\n[RequestId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ErrorClass=RESOURCE_DOES_NOT_EXIST] Status code: -1 error code: null error message: Cannot resolve hostname: xxxxxx.dfs.core.windows.net\nThis is not specific to any particular configuration or setting but rather an unexpected error related to the storage account.\nCause\nAn incorrect storage account is being used as the root storage for your metastore. The storage account\nxxxxx.dfs.core.windows.net\nis not recognized and does not exist in your cloud environment. This causes an error when you try to create managed tables, as they are stored in a different location that you previously specified.\nThe storage root bucket is not shown in the\nCatalog Explorer\nand the external location added in the metastore is not the root storage for this metastore.\nSolution\nChange the default root storage for your metastore to a valid storage account.\nUse the\nupdate a metastore\nAPI to update the metastore.\nTo prevent similar issues in the future, ensure that the root storage account for your metastore is valid and accessible.\nUse the\nget a metastore summary API\nto review the current metastore details.\nAdditionally, you should consider explicitly changing the storage root location for catalogs and schemas to a desired location, either at the catalog level or at the schema level." +} \ No newline at end of file diff --git a/scraped_kb_articles/resource_limit_exceeded-error-when-querying-a-delta-sharing-table.json b/scraped_kb_articles/resource_limit_exceeded-error-when-querying-a-delta-sharing-table.json new file mode 100644 index 0000000000000000000000000000000000000000..28ed30c9b69c3cff0537b78738280efd0a930e49 --- /dev/null +++ b/scraped_kb_articles/resource_limit_exceeded-error-when-querying-a-delta-sharing-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/resource_limit_exceeded-error-when-querying-a-delta-sharing-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are querying a Delta table shared in Delta Sharing and you get a\nRESOURCE_LIMIT_EXCEEDED\nerror.\nThe error may reference a timeout.\nio.delta.sharing.spark.util.UnexpectedHttpStatus: HTTP request failed with status: HTTP/1.1 400 Bad Request {\"errorCode\":\"RESOURCE_LIMIT_EXCEEDED\",\"message\":\"A timeout occurred when processing the table. If it continues to happen, please contact your data provider to share a smaller table instead.\"}.\nThe error may also reference table metadata that exceeds size limits.\n{ 'errorCode': 'RESOURCE_LIMIT_EXCEEDED',\r\n'message': 'The table metadata size exceeded limits'}\nCause\nDelta Sharing has limits on the metadata size of a shared table.\nYou are limited to 700k\nAddFiles\nactions in the DeltaLog. This is how many active files you can have in a shared Delta table.\nYou are limited to 100k\nRemoveFiles\nactions in the DeltaLog. This is the number of files that have been deleted. This includes files that have been removed by operations like\nOPTIMIZE\nand MERGE.\nSolution\nYou can run\nOPTIMIZE\n(\nAWS\n|\nAzure\n|\nGCP\n) on the shared Delta table to reduce the number of active files.\nOPTIMIZE\ntable_name\n[WHERE\npredicate]\n[ZORDER\nBY\n(col_name1\n[,\n...\n]\n)\n]\nAfter running\nOPTIMIZE\nto reduce the number active files, you may hit the\nRemoveFiles\nlimit if the\nOPTIMIZE\ncommand removed more than 100K files.\nIf this is the case, you can temporarily lower the\ndelta.logRetentionDuration\nproperty. This lowers the length of time items stay in the DeltaLog. By setting it to a short retention time, for example 24 hours, the transaction log is quickly purged, which helps you stay under the 100K\nRemoveFiles\nlimit.\n%sql\r\n\r\nALTER TABLE \r\nSET TBLPROPERTIES ('delta.logRetentionDuration'='24 hrs')\nDelete\nWarning\nLowering the\ndelta.logRetentionDuration\nproperty also reduces your ability to time travel. You can only time travel if the metadata is contained in the DeltaLog. If the log retention is set to 24 hours, you can only time travel back 24 hours.\nOnce the issue is resolved, you should revert the\ndelta.logRetentionDuration\nproperty back to 30 days, so you can continue to use the time travel feature.\n%sql\r\n\r\nALTER TABLE \r\nSET TBLPROPERTIES ('delta.logRetentionDuration'='30 days')\nTo prevent the issue from reoccurring, you should run\nOPTIMIZE\nperiodically. This helps keep the number of active files below the limit." +} \ No newline at end of file diff --git a/scraped_kb_articles/resources_exhausted-error-message-when-trying-to-perform-self-joins-with-spark-connect.json b/scraped_kb_articles/resources_exhausted-error-message-when-trying-to-perform-self-joins-with-spark-connect.json new file mode 100644 index 0000000000000000000000000000000000000000..d897385519cd32a81ce2625efa5605f604cff933 --- /dev/null +++ b/scraped_kb_articles/resources_exhausted-error-message-when-trying-to-perform-self-joins-with-spark-connect.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/resources_exhausted-error-message-when-trying-to-perform-self-joins-with-spark-connect", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to perform self-joins, you encounter an issue with the maximum message size for Spark Connect. You receive the following error message.\nSparkConnectGrpcException vs )”...\nCause\nYour data is locally stored as a\nlocal_relation\n. Data stored this way is duplicated in the planning phase of a query during self-joins. The query plan becomes too large from the duplication, reaching the maximum message size.\nWhen you then perform a union including a large number of self-joins, the command fails with the message\nRESOURCE_EXHAUSTED\n, indicating that the sent message is larger than the maximum allowed size.\nSolution\nIncrease the message size limit from the default of 64 MB by changing the value of the\nspark.sql.session.localRelationCacheThreshold\nconfiguration. You can start by trying twice the default, or 128MB. Experiment with increasing further if the issue persists.\nAlternatively, take advantage of temporary (temp) views. Using temp views in the intermediary steps caches the table and uses a\ncached_relation\ninstead of a\nlocal_relation\n. Cached relations do not have a maximum message size, allowing you to avoid a message size limit.\nExample\nThe following code demonstrates how to use temp views. Write\ndf_union\nas a temp view and then read it on every step of the loop to ensure the message being sent uses a cached relation.\n# dfs is a list of dataframes to aggregate with union commands. df_union stores the aggregation, and starts with the first dataframe in the list.\r\ndf_union = dfs[0]\r\n\r\n#loop through all other dataframes in the list performing unions\r\nfor df in dfs[1:]:\r\n\tdf_union = df_union.union(df)\r\n\t#create the temp view\r\n\tdf_union.createOrReplaceTempView(\"df_union\")\r\n\t#make it so df_union now have data coming from the temp view and can take advantage of the cached_relation\r\n\tdf_union = spark.sql(\"SELECT * FROM df_union\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/restricting-sensitive-data-in-the-workspace.json b/scraped_kb_articles/restricting-sensitive-data-in-the-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..8acfe07f280f2f98c1434509c9c7aa123238dde5 --- /dev/null +++ b/scraped_kb_articles/restricting-sensitive-data-in-the-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/restricting-sensitive-data-in-the-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to restrict workspace users from accessing specific data. For example, you have sensitive data that you do not want everyone to be able to access or modify.\nCause\nThe DBFS root is accessible to all users and does not support access control. You should not save sensitive data on DBFS.\nFor more information, review the\nRecommendations for working with DBFS root\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nUnity Catalog\nIf your workspace is using Unity Catalog, you should store your data in\nUnity Catalog volumes\n(\nAWS\n|\nAzure\n|\nGCP\n).\nYou can use SQL or the workspace UI to manage file permissions.\nWorkspace Files\nIf your workspace is not Unity Catalog enabled, you should store your data as\nworkspace files\n(\nAWS\n|\nAzure\n|\nGCP\n). Access to workspace files can be managed with\naccess control lists\n(\nAWS\n|\nAzure\n|\nGCP\n).\nFor more information, review the\nRecommendations for files in volumes and workspace files\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/resumed-streaming-job-fails-after-pause-with-streamingqueryexception-error.json b/scraped_kb_articles/resumed-streaming-job-fails-after-pause-with-streamingqueryexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..385d2939892ad059fa51fd2b2a62b6e0477693b1 --- /dev/null +++ b/scraped_kb_articles/resumed-streaming-job-fails-after-pause-with-streamingqueryexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/resumed-streaming-job-fails-after-pause-with-streamingqueryexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou resume a paused streaming job and it fails with a streaming query exception error message.\norg.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = ABC, runId = DEF] terminated with exception: The stream from your Delta table was expecting process data from version X, but the earliest available version in the _delta_log directory is Y.\nCause\nA streaming query exception can occur when the stream is paused for a time period that exceeds the\ndelta.logRetentionDuration\nvalue.\nDelta streaming accesses files within the\n/offsets\nand\n/commits\ndirectories located in the checkpoint path. This determines the last processed delta version as well as the next version to be processed.\nConsider a scenario where you have a Delta table, denoted as T1, serving as the stream's source. From the offset data, it is determined that the last version to be processed was X. If the stream is halted and then reinitiated after a 30-day hiatus, it will fail because of the default\ndelta.logRetentionDuration\nbeing 30 days. Consequently, delta version logs older than this period would have been purged. Upon restarting, the stream attempts to resume from version X, which is no longer present in the\n_delta_log\ndirectory, leading to an error.\nSolution\nYou should avoid pausing streaming jobs for a long period of time. Ideally, you should never pause a job for longer than the\ndelta.logRetentionDuration\nvalue. However, in instances where the stream must be paused for durations surpassing the\ndelta.logRetentionDuration\n, you can restart the stream with a new checkpoint location.\nWhen you restart the stream with a new checkpoint location, it compels the stream to reprocess all source data, potentially resulting in data duplication at the target sink. To avoid this, make sure you implement data deduplication strategies such as data filtering or utilizing Delta merge operations while writing into the delta sink.\nIf you cannot restart the stream with a new checkpoint location, you can use the\nfailOnDataLoss\nor the\nignoreMissingFiles\noptions.\nUsing the\n.option(\"failOnDataLoss\", \"false\")\nwithin the\nreadStream\nallows the stream to continue without interruption, automatically identifying a new delta version for processing.\nUsing the\n.option(\"ignoreMissingFiles\", \"true\")\nwithin the\nreadStream\nor the Apache Spark configuration\nspark.conf.set(“spark.sql.files.ignoreMissingFiles”, “true”)\nallows Spark to ignore any files from sources that are missing and continue without interruption.\nNote\nThe\nfailOnDataLoss\nand the\nignoreMissingFiles\noptions should only be used as a last resort. They may lead to data loss, as a significant range of delta logs could be missing and thus disregarded by the stream. In such cases, establishing an alternative pipeline to reprocess the omitted data is advisable.\nPreventative Measures\nTo avert these scenarios in the future, it is recommended to adjust the\ndelta.logRetentionDuration\naccording to your specific requirements. If the stream needs to be paused for a period extending beyond 30 days, modify the\ndelta.logRetentionDuration\nto 45 days or 60 days." +} \ No newline at end of file diff --git a/scraped_kb_articles/retrieve-queries-disabled-user.json b/scraped_kb_articles/retrieve-queries-disabled-user.json new file mode 100644 index 0000000000000000000000000000000000000000..eedc599cb5dde1fd0c4ec3c33848b9ec3a5279f5 --- /dev/null +++ b/scraped_kb_articles/retrieve-queries-disabled-user.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/retrieve-queries-disabled-user", + "title": "Título do Artigo Desconhecido", + "content": "When a Databricks SQL user is removed from an organization, the queries owned by the user remain, but they are only visible to those who already have permission to access them.\nA Databricks SQL admin can transfer ownership to other users, as well as delete alerts, dashboards, and queries owned by the disabled user account.\nClone a query\nA Databricks SQL admin has view access to all queries by default. As an admin, you can view and delete any query.\nYou cannot edit a query, if it is not shared with you. This includes admin users.\nThe solution is to clone a query, and then edit the permissions.\nOpen Databricks SQL.\nClick\nQueries\n.\nClick\nAdmin View\n.\nSelect the query you want to clone.\nClick the vertical ellipsis and select\nClone\n.\nYou can now edit the copy of the original query as needed.\nDelete a query\nOpen Databricks SQL.\nClick\nQueries\n.\nClick\nAdmin View\n.\nSelect the query you want to delete.\nClick the vertical ellipsis and select\nMove to Trash\n.\nClick\nMove to Trash\nto confirm.\nDelete\nWarning\nWhen you delete a query, all alerts and dashboard widgets created with its visualizations are also deleted." +} \ No newline at end of file diff --git a/scraped_kb_articles/revoke-all-user-privileges.json b/scraped_kb_articles/revoke-all-user-privileges.json new file mode 100644 index 0000000000000000000000000000000000000000..71f3d48cc686fb91676a7d50183212e502cecedd --- /dev/null +++ b/scraped_kb_articles/revoke-all-user-privileges.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/revoke-all-user-privileges", + "title": "Título do Artigo Desconhecido", + "content": "When user permissions are explicitly granted for individual tables and views, the selected user can access those tables and views even if they don’t have permission to access the underlying database.\nIf you want to revoke a user’s access, you can do so with the\nREVOKE\ncommand. However, the\nREVOKE\ncommand is explicit, and is strictly scoped to the object specified in the command.\nFor example:\n%sql\r\n\r\nREVOKE ALL PRIVILEGES ON DATABASE FROM `@`\r\nREVOKE SELECT ON FROM `@`\nIf you want to revoke all privileges for a single user you can do it with a series of multiple commands, or you can use a regular expression and a series of\nfor\nloops to automate the process.\nExample code\nThis example code matches the\n\npattern to the database name and the table name and then revokes the user’s privileges. The search is recursive.\n%python\r\n\r\nfrom re import search\r\ndatabaseQuery = sqlContext.sql(\"show databases\")\r\ndatabaseList = databaseQuery.collect()\r\n# This loop revokes at the database level.\r\nfor db in databaseList:\r\n  listTables = sqlContext.sql(\"show tables from \"+db['databaseName'])\r\n  tableRows = listTables.collect()\r\n  if search(, db['databaseName']):\r\n    revokeDatabase=sqlContext.sql(\"REVOKE ALL PRIVILAGES ON DATABASE \"+db['databaseName']+\" to ``\")\r\n    display(revokeDatabase)\r\n    print(\"Ran the REVOKE query on \"+db['databaseName']+\" for \")\r\n  # This loop revokes at the table level.\r\n  for table in tableRows:\r\n    if search(,table['tableName']):\r\n      revokeCommand=sqlContext.sql(\"REVOKE SELECT ON \"+table['database']+\".\"+table['tableName']+\" FROM ``\")\r\n      display(revokeCommand)\r\n      print(\"Revoked the SELECT permissions on \"+table['database']+\".\"+table['tableName']+\" for \")\nDelete\nInfo\nThese commands only work if you have enabled table access control for the cluster (\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/rmarkdown-sparklyr-code.json b/scraped_kb_articles/rmarkdown-sparklyr-code.json new file mode 100644 index 0000000000000000000000000000000000000000..2dfab1e8e5a10048b8a28ec117c2178e9fe0255b --- /dev/null +++ b/scraped_kb_articles/rmarkdown-sparklyr-code.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/rmarkdown-sparklyr-code", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter you install and configure RStudio in the Databricks environment, when you launch RStudio and click the\nKnit\nbutton to knit a Markdown file that contains code to initialize a\nsparklyr\ncontext, rendering fails with the following error:\nfailed to start sparklyr backend:object 'DATABRICKS_GUID' not found Calls: … tryCatch -> tryCatchList-> tryCatchOne -> Execution halted\nCause\nIf you try to initialize a\nsparklyr\ncontext in a Markdown notebook with code similar to the following, the Markdown page fails to render because the\nknitr\nprocess spawns a new namespace that is missing the\n'DATABRICKS_GUID'\nglobal variable.\n%r\r\n\r\ncase library(sparklyr)\r\nsc <- spark_connect(method = \"databricks\")\nSolution\nInstead of rendering the Markdown page by clicking the\nKnit\nbutton in the R Markdown console, use the following script:\n%r\r\n\r\nrmarkdown::render(\"your_doc.Rmd\")\nRender the Markdown file using the R console, and then you can access the file in the RStudio\nFiles\ntab." +} \ No newline at end of file diff --git a/scraped_kb_articles/rocksdb-fails-to-acquire-lock.json b/scraped_kb_articles/rocksdb-fails-to-acquire-lock.json new file mode 100644 index 0000000000000000000000000000000000000000..7003d348bc28faefe70aba99127bd57db03a41ff --- /dev/null +++ b/scraped_kb_articles/rocksdb-fails-to-acquire-lock.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/rocksdb-fails-to-acquire-lock", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use\nRocksDB\nas a state store for your structured streaming application, when you get an error message saying that the instance could not be acquired.\nCaused by: java.lang.IllegalStateException: RocksDB instance could not be acquired by [ThreadId: 742, task: 140.3 in stage 3152, TID 553193] as it was not released by [ThreadId: 42, task: 140.1 in stage 3152, TID 553083] after 10009 ms\r\nStateStoreId(opId=0,partId=140,name=default)\nCause\nTwo concurrent tasks cannot modify the same\nRocksDBStateStore\ninstance.\nConcurrent tasks attempting to access the same state store (the state store tied to the same partition of state maintained by\nflatMapGroupsWithState\n) should be extremely rare. It can only happen if the task updating the store instance was restarted by the driver before the previous attempt had terminated.\nDelete\nInfo\nAbrupt node termination, like when a spot instance terminates, can also cause this error.\nSolution\nThis error prevents the state from being corrupted. Restart the query if you encounter this error.\nIf zombie tasks are taking too long to clean up their resources, when the next task tries to acquire a lock, it will also fail. In this case, you should allow more time for the thread to clean up.\nSet the wait time for the thread by configuring\nrocksdb.lockAcquireTimeoutMs\nin your SQL configuration. The value is in milliseconds.\n%scala\r\nspark.sql(\"set spark.sql.streaming.stateStore.rocksdb.lockAcquireTimeoutMs = 20000\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/round-function-not-returning-the-number-of-decimal-places-indicated-in-the-parameters.json b/scraped_kb_articles/round-function-not-returning-the-number-of-decimal-places-indicated-in-the-parameters.json new file mode 100644 index 0000000000000000000000000000000000000000..e4cfe68df838084edc311ea0303490a5a8001b46 --- /dev/null +++ b/scraped_kb_articles/round-function-not-returning-the-number-of-decimal-places-indicated-in-the-parameters.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/round-function-not-returning-the-number-of-decimal-places-indicated-in-the-parameters", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the\nround()\nfunction in Databricks SQL with floating point numbers, you notice the output does not adhere to the parameters.\nExample query\nIn the following query, you indicate a parameter of three decimal places, but you receive not-three in the output.\n```sql \r\nspark.sql(\"select round(float(0.9200000166893005), 3)\").display() \r\n```\nreturns\n0.9200000166893005\ninstead of\n0.920\n.\nYou notice when using decimal numbers the output aligns to the parameter of three decimal places.\n```sql \r\nspark.sql(\"select round(0.9200000166893005, 3)\").display() \r\n```\nreturns\n0.920\n.\nCause\nDecimal numbers are represented and processed with exact precision. Floating point numbers are represented and processed in binary using the IEEE 754 standard.\nThe IEEE 754 standard cannot accurately represent many decimal fractions, leading to small precision errors when converting between binary and decimal representations.\nIn the example provided,\nfloat(0.9200000166893005)\nretains more precision than a standard floating point number, causing the\nround()\nfunction to behave unexpectedly.\nSolution\nUse string formatting in Python or cast the result to a\nDECIMAL\ntype in SQL.\nIn Python, you can use string formatting like\n\"{:.3f}\".format(number)\nto display the number as\n0.920\n, regardless of minor floating point inaccuracies.\nIn SQL, casting the result to a\nDECIMAL(p,s)\ntype ensures the number is displayed with the desired precision.\n```sql\r\nspark.sql(\"SELECT CAST(ROUND(float(0.9200000166893005), 3) AS DECIMAL(10, 3))\").display()\r\n```" +} \ No newline at end of file diff --git a/scraped_kb_articles/row-value-assignments-not-reflecting-expected-output-in-code-that-loops-through-temporary-views.json b/scraped_kb_articles/row-value-assignments-not-reflecting-expected-output-in-code-that-loops-through-temporary-views.json new file mode 100644 index 0000000000000000000000000000000000000000..2df1b393e59ca4fd8ae7e60baea8da8e8d4cadbe --- /dev/null +++ b/scraped_kb_articles/row-value-assignments-not-reflecting-expected-output-in-code-that-loops-through-temporary-views.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/row-value-assignments-not-reflecting-expected-output-in-code-that-loops-through-temporary-views", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a cluster with Apache Spark Connect to run code that invokes temporary views in a loop, the row value assignments do not reflect the expected output.\nExample\nIn code with a\nRANGE\nof\n2\n, you expect row one to have the value 0, and row two to have the value 1. Instead, both rows have the value 1.\nRANGE = 2\r\ndf_temp_view = None\r\nfor i in range(RANGE):\r\ndf = spark.sql(f\"select {i} as iterator\")\r\ndf.createOrReplaceTempView(\"temp_view\")\r\nif df_temp_view is None:\r\ndf_temp_view = spark.sql(\"select * from temp_view\")\r\nelse:\r\ndf_temp_view = df_temp_view.union(spark.sql(\"select * from temp_view\"))\r\n\r\ndf_temp_view.display()\nCause\nTemporary views in Spark Connect are analyzed lazily. This means any changes to the temporary view are not validated until the view is called, including filters and transformations.\nBecause the temporary view is recreated on each iteration, at the moment the Spark action is called Spark analyzes the latest version of the view, producing two rows with the same value.\nSolution\nUse DataFrames directly instead of temporary views. For more information, refer to the\nTutorial: Load and transform data using Apache Spark DataFrames\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you prefer to continue using temporary views in this context, apply unique names to each temporary view." +} \ No newline at end of file diff --git a/scraped_kb_articles/rstudio-server-backend-error.json b/scraped_kb_articles/rstudio-server-backend-error.json new file mode 100644 index 0000000000000000000000000000000000000000..ea53dcc93c4ed3585d52d6a24c48add9fc8508cb --- /dev/null +++ b/scraped_kb_articles/rstudio-server-backend-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/rstudio-server-backend-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou get a backend connection error when using RStudio server.\nError in Sys.setenv(EXISTING_SPARKR_BACKEND_PORT = system(paste0(\"wget -qO - 'http://localhost:6061/?type=\\\"com.databricks.backend.common.rpc.DriverMessages$StartRStudioSparkRBackend\\\"' --post-data='{\\\"@class\\\":\\\"com.databricks.backend.common.rpc.DriverMessages$StartRStudioSparkRBackend\\\", \\\"guid\\\": \\\"\", :\r\nwrong length for argument\nIf you view the cluster driver and worker logs (\nAWS\n|\nAzure\n|\nGCP\n), you see a message about exceeding the maximum number of RBackends.\n21/08/09 15:02:26 INFO RDriverLocal: 312. RDriverLocal.3f6d80d6-70c4-4101-b50f-2530df112ea2: Exceeded maximum number of RBackends limit: 200\r\n21/08/09 15:03:55 INFO RDriverLocal: 313. RDriverLocal.3f6d80d6-70c4-4101-b50f-2530df112ea2: Exceeded maximum number of RBackends limit: 200\r\n21/08/09 15:04:06 INFO RDriverLocal: 314. RDriverLocal.3f6d80d6-70c4-4101-b50f-2530df112ea2: Exceeded maximum number of RBackends limit: 200\r\n21/08/09 15:13:42 INFO RDriverLocal: 315. RDriverLocal.3f6d80d6-70c4-4101-b50f-2530df112ea2: Exceeded maximum number of RBackends limit: 200\nCause\nDatabricks clusters are configured for 200\nRBackends\nby default.\nIf you exceed this limit, you get an error.\nSolution\nYou can use an init script to increase the soft limit of\nRBackends\navailable for use.\nThis sample code creates an init script that sets a limit of 400\nRBackends\non the cluster.\n%scala\r\n\r\nval initScriptContent = s\"\"\"\r\n |#!/bin/bash\r\n |cat > /databricks/common/conf/rbackend_limit.conf << EOL\r\n |{\r\n | databricks.daemon.driver.maxNumRBackendsPerDriver = 400\r\n |}\r\n |EOL\r\n\"\"\".stripMargin\r\n\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//set_rbackend.sh\",initScriptContent, true)\nDelete\nInfo\nThe sample code sets the\nRBackends\nlimit to 400. You can adjust this number as needed. You should not exceed 500\nRBackends\n.\nInstall the newly created init script as a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nYou will need the full path to the location of the script (\ndbfs:/databricks//set_rbackend.sh\n).\nRestart the cluster after you have installed the init script.\nValidate solution\nYou can confirm that the changes were successful by running this sample code in a notebook.\n%r\r\n\r\nlibrary(magrittr)\r\nSparkR:::callJStatic(\r\n  \"com.databricks.backend.daemon.driver.RDriverLocal\",\r\n  \"getDriver\",\r\n  get(DB_GUID_, envir = .GlobalEnv)) %>% SparkR:::callJMethod(\"conf\") %>% SparkR:::callJMethod(\"maxNumRBackendsPerDriver\")\nWhen run, this code returns the current\nRBackends\nlimit on the cluster.\nBest practices\nEnsure that you log out of RStudio when you are finished using it. This terminates the R session and cleans the\nRBackend\n.\nIf the RStudio server is killed, or the RSession terminates unexpectedly, the cleanup step may not happen.\nDatabricks Runtime 9.0 and above automatically cleans up idle RBackend sessions." +} \ No newline at end of file diff --git a/scraped_kb_articles/run-a-custom-databricks-runtime-on-your-cluster.json b/scraped_kb_articles/run-a-custom-databricks-runtime-on-your-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..c2611500b25b50f78fbbbab485f546790b7bf9f0 --- /dev/null +++ b/scraped_kb_articles/run-a-custom-databricks-runtime-on-your-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/run-a-custom-databricks-runtime-on-your-cluster", + "title": "Título do Artigo Desconhecido", + "content": "The majority of Databricks customers use production Databricks Runtime releases (\nAWS\n|\nAzure\n|\nGCP\n) for their clusters. However, there may be certain times when you are asked to run a custom Databricks Runtime after raising a support ticket.\nDelete\nWarning\nCustom Databricks Runtime images are created for specific, short-term fixes and edge cases. If a custom image is appropriate, it will be provided by Databricks Support during case resolution.\nDatabricks Support cannot provide a custom image on demand. You should NOT open a ticket just to request a custom Databricks Runtime.\nThis article explains how to start a cluster using a custom Databricks Runtime image after you have been given the Runtime image name by support.\nInstructions\nUse the workspace UI\nFollow the steps for your specific browser to add the\nCustom Spark Version\nfield to the\nNew Cluster\nmenu.\nAfter you have enabled the\nCustom Spark Version\nfield, you can use it to start a new cluster using the custom Databricks runtime image you were given by support.\nDelete\nInfo\nWhen following the steps in this article, you will see a warning in your browser's Javascript console that says:\nDo not copy-paste anything here. This can be used to compromise your account.\nIt is OK to enter the commands listed in this article.\nChrome / Edge\nLogin to your Databricks workspace.\nClick\nCompute\n.\nClick\nAll-purpose clusters\n.\nClick\nCreate Cluster\n.\nPress Command+Option+J (Mac) or Control+Shift+J (Windows, Linux, ChromeOS) to open the Javascript console.\nEnter\nwindow.prefs.set(\"enableCustomSparkVersions\",true)\nin the Javascript console and run the command.\nReload the page.\nCustom Spark Version\nnow appears in the New Cluster menu.\nEnter the custom Databricks runtime image name that you got from Databricks support in the\nCustom Spark Version\nfield.\nContinue creating your cluster as normal.\nFirefox\nLogin to your Databricks workspace.\nClick\nCompute\n.\nClick\nAll-purpose clusters\n.\nClick\nCreate Cluster\n.\nPress Command+Option+K (Mac) or Control+Shift+K (Windows, Linux) to open the Javascript console.\nEnter\nwindow.prefs.set(\"enableCustomSparkVersions\",true)\nin the Javascript console and run the command.\nReload the page.\nCustom Spark Version\nnow appears in the New Cluster menu.\nEnter the custom Databricks runtime image name that you got from Databricks support in the\nCustom Spark Version\nfield.\nContinue creating your cluster as normal.\nSafari\nLogin to your Databricks workspace.\nClick\nCompute\n.\nClick\nAll-purpose clusters\n.\nClick\nCreate Cluster\n.\nPress Command+Option+C (Mac) to open the Javascript console.\nEnter\nwindow.prefs.set(\"enableCustomSparkVersions\",true)\nin the Javascript console and run the command.\nReload the page.\nCustom Spark Version\nnow appears in the New Cluster menu.\nEnter the custom Databricks runtime image name that you got from Databricks support in the\nCustom Spark Version\nfield.\nContinue creating your cluster as normal.\nUse the API\nYou need to set the custom image with the\nspark_version\nattribute when starting a cluster via the API.\nYou can use the API to create both interactive clusters and job clusters with a custom Databricks runtime image.\n\"spark_version\": \"custom:\nExample code\nThis sample code shows the\nspark_version\nattribute used within the context of starting a cluster via the API.\n%sh\r\n\r\ncurl -H \"Authorization: Bearer \" -X POST  https:///api/2.0/clusters/create -d '{\r\n  \"cluster_name\": \"heap\",\r\n  \"spark_version\": \"custom:\",\r\n  \"node_type_id\": \"r3.xlarge\",\r\n  \"spark_conf\": {\r\n    \"spark.speculation\": true\r\n  },\r\n  \"aws_attributes\": {\r\n    \"availability\": \"SPOT\",\r\n    \"zone_id\": \"us-west-2a\"\r\n  },\r\n  \"num_workers\": 1,\r\n  \"spark_env_vars\": {\r\n    \"SPARK_DRIVER_MEMORY\": \"25g\"\r\n  }\r\n}'\nFor more information please review the create Clusters API 2.0 (\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/running-a-python-udf-fails-with-permission-error.json b/scraped_kb_articles/running-a-python-udf-fails-with-permission-error.json new file mode 100644 index 0000000000000000000000000000000000000000..edf71b057277227e9e36df0d4206c3a3a9c6292a --- /dev/null +++ b/scraped_kb_articles/running-a-python-udf-fails-with-permission-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/running-a-python-udf-fails-with-permission-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run a Python UDF without EXECUTE permission, the UDF fails.\n‘User does not have EXECUTE on Routine or Model '..'\nCause\nDatabricks strictly enforces\nEXECUTE\npermissions on Python UDFs. Only authorized users can run UDFs.\nPython UDFs are able to perform complex, resource-intensive operations, such as custom logic and external API calls, which go beyond the scope of SQL UDFs. Strong controls on this complexity allow your organization to better manage the increased potential for security and stability risks. Better management enhances system security and governance.\nSolution\nThe function owner needs to grant\nEXECUTE\npermission for a Python UDF to a user or user group.\nUser\nGRANT EXECUTE ON FUNCTION .. TO ;\nGroup\nGRANT EXECUTE ON FUNCTION .. TO " +} \ No newline at end of file diff --git a/scraped_kb_articles/running-c-plus-plus-code-scala.json b/scraped_kb_articles/running-c-plus-plus-code-scala.json new file mode 100644 index 0000000000000000000000000000000000000000..fa799825c9ffa56fd19a822eeadd6142926955a8 --- /dev/null +++ b/scraped_kb_articles/running-c-plus-plus-code-scala.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/running-c-plus-plus-code-scala", + "title": "Título do Artigo Desconhecido", + "content": "Run C++ from Scala notebook\nReview the\nRun C++ from Scala notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/running-c-plus-plus-code.json b/scraped_kb_articles/running-c-plus-plus-code.json new file mode 100644 index 0000000000000000000000000000000000000000..342c9d807699fcdf06426ddfcf84e711da4e5057 --- /dev/null +++ b/scraped_kb_articles/running-c-plus-plus-code.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/running-c-plus-plus-code", + "title": "Título do Artigo Desconhecido", + "content": "Run C++ from Python example notebook\nReview the\nRun C++ from Python notebook\nto learn how to compile C++ code and run it on a cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/running-general-queries-against-anonymous-or-temporary-functions-fails-with-insufficient_permissions-error.json b/scraped_kb_articles/running-general-queries-against-anonymous-or-temporary-functions-fails-with-insufficient_permissions-error.json new file mode 100644 index 0000000000000000000000000000000000000000..1fd7659e3302859ce67f219fa13c045a4c372913 --- /dev/null +++ b/scraped_kb_articles/running-general-queries-against-anonymous-or-temporary-functions-fails-with-insufficient_permissions-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/running-general-queries-against-anonymous-or-temporary-functions-fails-with-insufficient_permissions-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running general queries against anonymous or temporary functions, you may get an\n[INSUFFICIENT_PERMISSIONS] Insufficient privileges\nerror message.\nAn error occurred while calling o790.withColumn.\r\n: org.apache.spark.SparkSecurityException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:\r\nUser does not have permission SELECT on anonymous function. SQLSTATE: 42501\nCause\nANONYMOUS FUNCTION\ncontrols access to anonymous or temporary functions. If you do not have this privilege when querying those functions, the query fails with the underlying error.\nThe error can also occur if you’re using Databricks SQL.\nANONYMOUS FUNCTION\nobjects are not supported in Databricks SQL.\nSolution\nAs a workspace admin, run the following query to provide the affected entity with the required permission.\n%sql\r\n\r\nGRANT SELECT ON ANONYMOUS FUNCTION TO ``\nIn Databricks SQL environments, check if the Apache Spark setting\nspark.databricks.acl.sqlOnly true\nis present in your cluster setup. If it is, delete it.\nThis allows preference for Python or Scala, which work with\nANONYMOUS FUNCTION\n.\nFor more information on using\nANONYMOUS FUNCTION\n, review the\nCREATE FUNCTION (External)\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/running-optimize-on-delta-tables-causing-concurrentdeletedeleteexception-error.json b/scraped_kb_articles/running-optimize-on-delta-tables-causing-concurrentdeletedeleteexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..402ca0d49fc962857e5b08558ff0aad7fbc15a01 --- /dev/null +++ b/scraped_kb_articles/running-optimize-on-delta-tables-causing-concurrentdeletedeleteexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/running-optimize-on-delta-tables-causing-concurrentdeletedeleteexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen running\nOPTIMIZE\non a Delta table using a job, you receive an error.\n[DELTA_CONCURRENT_DELETE_DELETE] ConcurrentDeleteDeleteException: This transaction attempted to delete one or more files that were deleted (for example 2w/part-00004-c745c895-7e13-4f70-9e06-7243bc6c3174.c000.snappy.parquet) by a concurrent update. Please try the operation again.\nCause\nTwo or more jobs are attempting to perform optimization operations on the same table at the same time. There are two case types.\nAn\nAUTO OPTIMIZE\nis done, and a subsequent manual\nOPTIMIZE\nconflicts.\nYou perform a manual\nOPTIMIZE\n, and a subsequent\nAUTO OPTIMIZE\nconflicts.\nSolution\nFirst, check the conflicting commit message to verify which case is occurring.\nIf the conflicting commit message says\n\"auto\": true\n, this indicates an\nAUTO OPTIMIZE\njob clashed with a manual\nOPTIMIZE\n.\nIf the conflicting commit message says\n\"auto\": false\n, the manual operation clashed with an earlier auto-triggered one. Further confirm by running\nDESCRIBE HISTORY \n. Look for consecutive\nOPTIMIZE\noperations — one with\nauto=true\nand the other\nauto=false\n— around the timestamp in the error.\nThe solution remains the same for both cases. If you need to run\nOPTIMIZE\nmanually, disable\nAUTO OPTIMIZE\n. Since auto compaction and optimized writes are always enabled for\nMERGE\n,\nUPDATE\n, and\nDELETE\noperations, overwrite the functionality by adding the following two configurations.\nspark.databricks.delta.optimizeWrite.enabled false\r\nspark.databricks.delta.autoCompact.enabled false\nFor more information, review the\nConfigure Delta Lake to control data file size\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nImportant\nDatabricks generally recommends keeping\nAUTO OPTIMIZE\nenabled. Disabling could potentially lead to lower performance/higher costs." +} \ No newline at end of file diff --git a/scraped_kb_articles/runs-not-nested-sparktrials-hyperopt.json b/scraped_kb_articles/runs-not-nested-sparktrials-hyperopt.json new file mode 100644 index 0000000000000000000000000000000000000000..4a35e6b3d4b93c8dd64ab47eb3ea226bb9666af1 --- /dev/null +++ b/scraped_kb_articles/runs-not-nested-sparktrials-hyperopt.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/runs-not-nested-sparktrials-hyperopt", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nSparkTrials is an extension of\nHyperopt\n, which allows runs to be distributed to Spark workers.\nWhen you start an MLflow run with\nnested=True\nin the worker function, the results are supposed to be nested under the parent run.\nSometimes the results are\nnot\ncorrectly nested under the parent run, even though you ran SparkTrials with\nnested=True\nin the worker function.\nFor example:\n%python\r\n\r\nfrom hyperopt import fmin, tpe, hp, Trials, STATUS_OK\r\n\r\ndef train(params):\r\n  \"\"\"\r\n  An example train method that computes the square of the input.\r\n  This method will be passed to `hyperopt.fmin()`.\r\n\r\n  :param params: hyperparameters. Its structure is consistent with how search space is defined. See below.\r\n  :return: dict with fields 'loss' (scalar loss) and 'status' (success/failure status of run)\r\n  \"\"\"\r\n  with mlflow.start_run(run_name='inner_run', nested=True) as run:\r\n\r\n    x, = params\r\n  return {'loss': x ** 2, 'status': STATUS_OK}\r\n\r\nwith mlflow.start_run(run_name='outer_run_with_sparktrials'):\r\n  spark_trials_run_id = mlflow.active_run().info.run_id\r\n  argmin = fmin(\r\n    fn=train,\r\n    space=search_space,\r\n    algo=algo,\r\n    max_evals=16,\r\n    trials=spark_trials\r\n  )\nExpected results:\nActual results:\nCause\nThe open source version of Hyperopt does not support the required features necessary to properly nest SparkTrials MLflow runs on Databricks.\nSolution\nDatabricks Runtime for Machine Learning includes an internal fork of Hyperopt with additional features. If you want to use SparkTrials, you should use Databricks Runtime for Machine Learning instead of installing Hyperopt manually from open-source repositories." +} \ No newline at end of file diff --git a/scraped_kb_articles/runtimes-increase-when-using-loc-and-assignment-operations.json b/scraped_kb_articles/runtimes-increase-when-using-loc-and-assignment-operations.json new file mode 100644 index 0000000000000000000000000000000000000000..97cdae08a49056be0dd98b28b6c99bd388f50ba3 --- /dev/null +++ b/scraped_kb_articles/runtimes-increase-when-using-loc-and-assignment-operations.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/runtimes-increase-when-using-loc-and-assignment-operations", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the\n.loc()\nand\nassignment(=)\noperations, you notice performance degradation such as increased runtimes.\nThe following code is an example of using\n.loc()\nand\nassignment(=)\nin the\ndef func(data1)\nsection.\nimport numpy as np;\r\nimport pandas as pd;\r\nfrom pyspark.sql.functions import lit;\r\n\r\nsql_query = f\"\"\"select cast(cast(rand(1) * 20 as int) as string) as a, cast(1 as long) as b, cast(id as timestamp) as c, 0 as d FROM range(1000000) order by c\"\"\"\r\ndf1=spark.sql(sql_query);\r\ndf1 = df1.withColumn(\"e\", lit(None));\r\n\r\ndef func(data1):\r\n    i = 0\r\n    while i < (data1.shape[0]):\r\n      data1.loc[i, \"e\"] = data1.loc[i, \"d\"]\r\n      i = i + 1\r\n    return data1;\r\n\r\ndf2=df1.groupBy(\"a\", \"b\").applyInPandas(func, schema=\"a string, b long, c timestamp, d integer, e float\");\r\ndisplay(df2);\nCause\nThe use of\n.loc()\nand direct\nassignment(=)\noperations in pandas is generally discouraged because they can disable vectorized operations that NumPy performs under the hood.\nSolution\nUse vectorized operations instead. Vectorized operations are typically faster and more efficient, especially for large datasets.\nThe following code is the same example from the problem statement, with vectorized operations in replacing\n.loc()\nand\nassignment(=)\nin\ndef func(data1)\n.\nimport numpy as np;\r\nimport pandas as pd;\r\nfrom pyspark.sql.functions import lit;\r\n\r\nsql_query = f\"\"\"select cast(cast(rand(1) * 20 as int) as string) as a, cast(1 as long) as b, cast(id as timestamp) as c, 0 as d FROM range(1000000) order by c\"\"\"\r\ndf1=spark.sql(sql_query);\r\ndf1 = df1.withColumn(\"e\", lit(None));\r\n\r\ndef func(data1):\r\n  data1[\"e\"] = data1[\"d\"]\r\n  return data1;\r\n\r\ndf2=df1.groupBy(\"a\", \"b\").applyInPandas(func, schema=\"a string, b long, c timestamp, d integer, e float\");\r\ndisplay(df2);" +} \ No newline at end of file diff --git a/scraped_kb_articles/s3-path-data-size-for-a-delta-table-is-more-than-the-table-size-seen-from-the-describe-detail-output.json b/scraped_kb_articles/s3-path-data-size-for-a-delta-table-is-more-than-the-table-size-seen-from-the-describe-detail-output.json new file mode 100644 index 0000000000000000000000000000000000000000..d53d949030ccf31422e7d48d30750a814ba9c43a --- /dev/null +++ b/scraped_kb_articles/s3-path-data-size-for-a-delta-table-is-more-than-the-table-size-seen-from-the-describe-detail-output.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/s3-path-data-size-for-a-delta-table-is-more-than-the-table-size-seen-from-the-describe-detail-output", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe size of the Delta table does not match the size of the data in the storage bucket the table uses.\nCause\nYou may have stale files present in your storage path.\nTo verify, you can use\ndescribe detail \nto get your table size, and then compare to the actual size in the storage bucket.\nYou can also run\nVACUUM [RETAIN num HOURS] DRY RUN\nto check the details of the stale files.\nSolution\nRun\nVACUUM\nto remove the stale files.\nFor more information, refer to the\nVACUUM\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation and the\nRemove unused data files with vacuum\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/sas-requires-current-abfs-client.json b/scraped_kb_articles/sas-requires-current-abfs-client.json new file mode 100644 index 0000000000000000000000000000000000000000..512425e501e492340bd01daf1ce20fba776e3d0e --- /dev/null +++ b/scraped_kb_articles/sas-requires-current-abfs-client.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/cloud/sas-requires-current-abfs-client", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile using SAS token authentication, you encounter an\nIllegalArgumentException\nerror.\nIllegalArgumentException: No enum constant shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AuthType.SAS\nCause\nSAS requires the current ABFS client. Previous ABFS clients do not support SAS.\nSolution\nYou must use the current ABFS client (\nshaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem\n) to use SAS.\nThis ABFS client is available by default in Databricks Runtime 7.3 LTS and above.\nIf you are using an old ABFS client, you should update your code so it references the current ABFS client." +} \ No newline at end of file diff --git a/scraped_kb_articles/save-plotly-to-dbfs.json b/scraped_kb_articles/save-plotly-to-dbfs.json new file mode 100644 index 0000000000000000000000000000000000000000..735831082b264a4d677882c8b6a8d49255cacae8 --- /dev/null +++ b/scraped_kb_articles/save-plotly-to-dbfs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/visualizations/save-plotly-to-dbfs", + "title": "Título do Artigo Desconhecido", + "content": "You can save a chart generated with Plotly to the driver node as a jpg or png file. Then, you can display it in a notebook by using the\ndisplayHTML()\nmethod. By default, you save Plotly charts to the\n/databricks/driver/\ndirectory on the driver node in your cluster. Use the following procedure to display the charts at a later time.\nGenerate a sample plot:\n%python\r\n\r\ndata = {'data': [{'y': [4, 2, 3, 4]}],\r\n            'layout': {'title': 'Test Plot',\r\n                       'font': dict(size=16)}}\r\np = plot(data,output_type='div')\r\ndisplayHTML(p)\nSave the generated plot to a file with\nplotly.io.write_image()\n:\n%sh\r\n\r\nplotly.io.write_image(fig=data,file=\"/databricks/driver/plotly_images/.jpg\", format=\"jpeg\",scale=None, width=None, height=None)\nCopy the file from the driver node and save it to DBFS:\n%sh\r\n\r\ndbutils.fs.cp(\"file:/databricks/driver/plotly_images/.jpg\", \"dbfs:/FileStore//.jpg\")\nDisplay the image using\ndisplayHTML()\n:\n%sh\r\n\r\ndisplayHTML('''/.jpg\">''')\nSee also Plotly in Python and R Notebooks." +} \ No newline at end of file diff --git a/scraped_kb_articles/scala-collection-immutable-hashmap-hashmap1-class-leading-to-oom-error-in-driver.json b/scraped_kb_articles/scala-collection-immutable-hashmap-hashmap1-class-leading-to-oom-error-in-driver.json new file mode 100644 index 0000000000000000000000000000000000000000..6aeebcf43bf39070ff067189aba830920bd6d241 --- /dev/null +++ b/scraped_kb_articles/scala-collection-immutable-hashmap-hashmap1-class-leading-to-oom-error-in-driver.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/scala-collection-immutable-hashmap-hashmap1-class-leading-to-oom-error-in-driver", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use the\nscala.collection.immutable.HashMap$HashMap1\nclass, you notice performance issues and compute (formerly cluster) instability in your Databricks environment. You receive error messages such as\nDriver is up but is not responsive, likely due to GC\nor\nDriver is up but is not responsive, likely due to out of memory\n.\nYou confirm the OOM error is due to the heap allocation for Hashmap by taking the heap dump on the driver and checking whether it points to the\nscala.collection.immutable.HashMap$HashMap1\nclass.\nCause\nBy design, the driver memory accumulates Apache Spark UI events, including the\nscala.collection.immutable.HashMap$HashMap1\nclass, which has the most heap allocation. The accumulation leads to increased garbage collection (GC) activity.\nWhen the number of events exceeds the available memory, the driver struggles to manage the heap. This struggle leads to GC pauses and eventually causes the driver to become unresponsive or run out of memory.\nSolution\nChange the events location from driver memory to RockDB.\nNavigate to the Databricks workspace and select the affected compute (formerly cluster).\nClick the\nEdit\nbutton to modify the compute configuration.\nExpand the\nAdvanced options\nsection and select the\nSpark\n.\nAdd the property\nspark.ui.store.path /databricks/driver/sparkuirocksdb\nin the\nSpark config\nfield.\nClick\nConfirm\nto save the changes.\nRestart the compute for the configuration change to take effect." +} \ No newline at end of file diff --git a/scraped_kb_articles/schema-from-case-class.json b/scraped_kb_articles/schema-from-case-class.json new file mode 100644 index 0000000000000000000000000000000000000000..e6d53a052d9104a804835ab4ec396dea9529dfb2 --- /dev/null +++ b/scraped_kb_articles/schema-from-case-class.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/schema-from-case-class", + "title": "Título do Artigo Desconhecido", + "content": "Spark provides an easy way to generate a schema from a Scala case class. For case class\nA\n, use the method\nScalaReflection.schemaFor[A].dataType.asInstanceOf[StructType]\n.\nFor example:\n%scala\r\n\r\nimport org.apache.spark.sql.types.StructType\r\nimport org.apache.spark.sql.catalyst.ScalaReflection\r\n\r\ncase class A(key: String, time: java.sql.Timestamp, date: java.sql.Date, decimal: java.math.BigDecimal, map: Map[String, Int], nested: Seq[Map[String, Seq[Int]]])\r\nval schema = ScalaReflection.schemaFor[A].dataType.asInstanceOf[StructType]\r\nschema.printTreeString" +} \ No newline at end of file diff --git a/scraped_kb_articles/schema-mismatch-issue-while-reading-parquet-files.json b/scraped_kb_articles/schema-mismatch-issue-while-reading-parquet-files.json new file mode 100644 index 0000000000000000000000000000000000000000..264585bb289f0b8890e2ed007b81de9ebb1963fe --- /dev/null +++ b/scraped_kb_articles/schema-mismatch-issue-while-reading-parquet-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/schema-mismatch-issue-while-reading-parquet-files", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to read data from a source directory containing multiple parquet files, you encounter an issue.\ns3:///test_file.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)\nCause\nThere is a schema mismatch between two parquet files in the same source directory.\nWhen Databricks attempts to read the files and unify their schemas, it encounters a type mismatch, which leads to the error.\nSolution\nFix the file schema. Identify the columns with schema discrepancies and modify them to have a consistent data type across all files.\nIf modifying the files is not an option, you can read the files separately and then union them. This approach allows you to handle schema differences.\nNote\nThis solution will not work for data type differences like timestamp and int. In that case you should correct the file or put the data in two separate tables." +} \ No newline at end of file diff --git a/scraped_kb_articles/security-bulletin-databricks-jdbc-driver-vulnerability-advisory-cve-2024-49194.json b/scraped_kb_articles/security-bulletin-databricks-jdbc-driver-vulnerability-advisory-cve-2024-49194.json new file mode 100644 index 0000000000000000000000000000000000000000..ddc92a771e1ac91500784cf7a6599c484d871513 --- /dev/null +++ b/scraped_kb_articles/security-bulletin-databricks-jdbc-driver-vulnerability-advisory-cve-2024-49194.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/security-bulletin-databricks-jdbc-driver-vulnerability-advisory-cve-2024-49194", + "title": "Título do Artigo Desconhecido", + "content": "Bulletin ID: DB-2024-01\nPublication Date: 2024-DEC-11\nLast Updated: 2024-DEC-11\nProblem\nA vulnerability in the Databricks JDBC Driver could potentially allow remote code execution (RCE) by triggering a JNDI injection via a JDBC URL parameter. This issue was reported via the Databricks bug bounty program and was assigned\nCVE-2024-49194\n. It is rated with a severity impact of high and is patched in Databricks JDBC Driver version 2.6.40 and above.\nCVE ID\nAffected Product Versions\nFixed Product Versions\nCVSSv3.1\nCVE-2024-49194\n2.6.38 and below\n2.6.40 and above\n7.3\nCause\nThe vulnerability is rooted in the improper handling of the\nkrbJAASFile\nparameter. An attacker could potentially exploit this vulnerability to gain RCE in the context of the driver by tricking the victim to use a specially crafted connection URL using the property\nkrbJAASFile\n.\nSolution\nAll current versions of Databricks Runtime on Databricks compute and serverless compute have already been patched and/or mitigated. Databricks recommends that you restart any long running clusters to ensure you are using the latest version of your selected runtime.\nIf you are running an impacted version of the JDBC driver on your local machine, you can mitigate the vulnerability by updating the driver. If you cannot update your JDBC driver, you should update your JVM configuration.\nUpdate JDBC driver\nThe\nDatabricks JDBC Driver version 2.6.40 and above\nfully resolves the issue.\nDatabricks recommends you download and install the updated driver immediately.\nUpdate JVM configuration\nIf you cannot update your JDBC Driver you can update two values in your JVM configuration to prevent arbitrary deserialization, via JNDI, which mitigates this vulnerability.\nEnsure the following configuration values are set to false:\ncom.sun.jndi.ldap.object.trustURLCodebase\ncom.sun.jndi.ldap.object.trustSerialData\nContact Information\nIf you have any questions, email Databricks support at\nhelp@databricks.com\nor the Databricks Security Team at\nsecurity@databricks.com\nwith the subject line\nCVE-2024-49194\n.\nFor vulnerability reporting, please visit\nhttps://hackerone.com/databricks\n.\nAcknowledgments\nWe would like to thank Ziyang Li, Ji'an Zhou, Ying Zhu of Alibaba Cloud Intelligence Security Team for their collaboration in identifying and addressing this issue.\nChangelog\n2024-DEC-11: Initial release of the security bulletin." +} \ No newline at end of file diff --git a/scraped_kb_articles/seeing-slow-running-jobs-while-adaptive-parallelism-enabled.json b/scraped_kb_articles/seeing-slow-running-jobs-while-adaptive-parallelism-enabled.json new file mode 100644 index 0000000000000000000000000000000000000000..474a90553f5019c72d77ca505ce54d81b3271fb8 --- /dev/null +++ b/scraped_kb_articles/seeing-slow-running-jobs-while-adaptive-parallelism-enabled.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/seeing-slow-running-jobs-while-adaptive-parallelism-enabled", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice when your number of tasks decreases, parallelism decreases, and the job runs more slowly. Conversely, when the number of tasks increases, parallelism increases, and the job runs faster.\nCause\nYou have adaptive parallelism enabled.\nAdaptive parallelism allows fewer tasks to be planned based on the number of concurrent queries. When the dynamic changes happen multiple times on one job (for example, one query runs much longer than another), adaptive parallelism may not perform as optimally as expected.\nSolution\nDisable adaptive parallelism.\nNavigate to the cluster in question.\nIn the cluster configuration page, click the\nEdit\nbutton.\nScroll down to\nAdvanced\nand click to expand.\nClick the\nSpark\ntab, and in the\nSpark config\nfield add\nspark.databricks.execution.adaptiveParallelism.enabled false\nClick the\nSave\nbutton at the bottom of the page to apply the change to the cluster settings." +} \ No newline at end of file diff --git a/scraped_kb_articles/seeing-unexpected-system-generated-queries-in-the-query-history.json b/scraped_kb_articles/seeing-unexpected-system-generated-queries-in-the-query-history.json new file mode 100644 index 0000000000000000000000000000000000000000..7c1889732b01211e74475107666239a00063c1b2 --- /dev/null +++ b/scraped_kb_articles/seeing-unexpected-system-generated-queries-in-the-query-history.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/seeing-unexpected-system-generated-queries-in-the-query-history", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou see the following system-generated query in your query history page and want to understand where it comes from.\n-- This is a system generated query from sql editor\r\ndescribe table extended `hive_metastore`.``.``\nCause\nThe following two scenarios are the most common triggers for system generated queries.\nDBSQL Intellisense and Catalog Explorer automatically trigger when working with tables in Hive metastore (HMS). The queries are part of how Databricks provides features like autocomplete, error highlighting, and schema previews.\nWhen your workspace default catalog is set to\nhive_metastore\n, these interactions are more frequent.\nThe current catalog/schema is set at the query editor level (either USQLE, legacy SQLE or through a USE statement in the query) which can differ from the default catalog set in the workspace.\nSolution\nSystem generated queries are part of normal Databricks behavior and are safe. But if they cause concern (such as query log noise or rarely, slowness), they can be optionally disabled or minimized through settings, or by switching to Unity Catalog.\nDisable or minimize through settings\nTo disable Autocomplete on the SQL editor:\nClick the kebab menu in the upper right side of the SQL editor pane.\nClick\nDisable Autocomplete\n.\nThen navigate to your workspace settings. From the settings landing page:\nGo to\nUser > Developer > Code Editor\n.\nToggle off\nSQL Syntax Error Highlighting\nto disable error highlighting.\nToggle off\nAutocomplete as you type\nto disable autocomplete.\nSwitch to Unity Catalog\nSwitching your default catalog to Unity Catalog (UC) reduces system-generated metadata queries while still allowing full access to HMS tables using a full path, such as\nhive_metastore.db.table\n.\nNavigate to your workspace settings.\nNavigate to\nWorkspace Admin > Advanced > Other\n.\nIn the\nDefault catalog\ntext field, type in the workspace level default catalog you want to use.\nClick the\nSave\nbutton.\nThen verify your query editor level catalog and schema matches the workspace.\nNavigate to\nSQL editor\nin the sidebar.\nClick the\n.\nname to the right of the\nRun\nbutton to expand the list of options.\nMake sure what you have selected matches the catalog and schema you just changed to in the\nWorkspace admin\nsettings." +} \ No newline at end of file diff --git a/scraped_kb_articles/select-on-view-not-showing-any-data-in-the-table-after-unsetting-timezone.json b/scraped_kb_articles/select-on-view-not-showing-any-data-in-the-table-after-unsetting-timezone.json new file mode 100644 index 0000000000000000000000000000000000000000..6c41cf4d3608c74d65cb4d6ac82bbb17730d449f --- /dev/null +++ b/scraped_kb_articles/select-on-view-not-showing-any-data-in-the-table-after-unsetting-timezone.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/select-on-view-not-showing-any-data-in-the-table-after-unsetting-timezone", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to change your data timezone from the default UTC to a specific region. You first unset a view’s table property\nview.sqlConfig.spark.sql.session.timeZone\nusing the following code.\nALTER VIEW .. UNSET TBLPROPERTIES ('view.sqlConfig.spark.sql.session.timeZone')\nYou then try to use the\nSELECT\noperation on the\nVIEW\nusing the following code to finish setting the timezone.\nSELECT * FROM ..\nYou see that no data appears from the table.\nCause\nWhen you unset the timezone table property, the data can no longer be read and so cannot be returned.\nSolution\nUse the following Apache Spark configuration, which uses your current session's SQL configs for view resolution, rather than the configs captured at the time the view was created or altered (such as UTC).\nspark.conf.set(\"spark.sql.legacy.useCurrentConfigsForView\", \"true\")\nFor details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/serverless-query-still-running-after-canceling.json b/scraped_kb_articles/serverless-query-still-running-after-canceling.json new file mode 100644 index 0000000000000000000000000000000000000000..c664d8ef7dbcdac9d3e07cd88af3939e6ee44d0e --- /dev/null +++ b/scraped_kb_articles/serverless-query-still-running-after-canceling.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/serverless-query-still-running-after-canceling", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA serverless query executed in a notebook still appears as\n\"running\"\nin the Databricks UI even after you attempt to cancel it.\nCause\nThe issue arises due to a combination of factors.\nThe connection between the REPL and the Spark Connect Gateway is lost, causing the interrupt process to remain incomplete.\nNo explicit interrupt request was sent to cancel the query, preventing proper termination.\nThere are gaps in the cleanup process, specifically the failure to call\npostClosed\nduring interruptions. This is essential to update the query history UI.\nThe SQL History service could not reconnect to the Spark Connect session to retrieve status updates after the session became inactive.\nSolution\nRun the\nCANCEL \ncommand from a different notebook or SQL editor to terminate the query forcefully.\nConfigure a timeout for queries by setting the\nspark.databricks.queryWatchdog.timeoutInSeconds\nproperty. This limits how long a query can run before being automatically terminated.\nVerify that the query is no longer displayed as\n\"running\"\nin the query history.\nFor more information, refer to the\nQuery history\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/serving-an-automl-model-failing-when-deployed-to-an-endpoint-with-failed-to-deploy-modelname-served-entity-creation-aborted-error.json b/scraped_kb_articles/serving-an-automl-model-failing-when-deployed-to-an-endpoint-with-failed-to-deploy-modelname-served-entity-creation-aborted-error.json new file mode 100644 index 0000000000000000000000000000000000000000..571b165af78a155053ec51d151ba6a3dfd65f7e6 --- /dev/null +++ b/scraped_kb_articles/serving-an-automl-model-failing-when-deployed-to-an-endpoint-with-failed-to-deploy-modelname-served-entity-creation-aborted-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/serving-an-automl-model-failing-when-deployed-to-an-endpoint-with-failed-to-deploy-modelname-served-entity-creation-aborted-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen serving a model trained using Databricks AutoML, you notice the model loads and runs inference correctly in a notebook, but fails when deployed to an endpoint with the following error.\nFailed to deploy modelName: served entity creation aborted because the endpoint update timed out. Please see service logs for more information.\nWhen you check the service logs, you see an additional error.\nValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected XX from C header, got from PyObject\nCause\nThe model environment includes a dependency such as pandas versions below 2.2.2, which is not compatible with NumPy 2.0.0 or above.\nHowever, MLflow does not automatically log a constraint like NumPy versions below 2.0.0, leading to an environment where pandas versions below 2.2.2 and NumPy version 2.0.0 or above can coexist.\nThis version mismatch causes binary incompatibility during model serving, resulting in the observed error.\nSolution\nEnsure the model environment includes an explicit version pin for NumPy, such as\nnumpy==\n, where\n\nis any version compatible with the pandas version mentioned in the dependency files.\nManually add this version constraint to the model’s\nconda.yaml\nand\nrequirements.txt\nfiles, then re-upload the updated files as artifacts to the same MLflow run.\n1. Identify the run ID for the model you want to serve.\n2. Use the following script to add a NumPy pin. The following script:\nDownloads the model’s existing\nconda.yaml\nand\nrequirements.txt\n.\nChecks if a NumPy version is pinned.\nIf not, it adds\nnumpy==\nbased on the local environment.\nUploads the updated files back as artifacts to the same run.\nimport mlflow\r\nimport os\r\nimport shutil\r\nimport tempfile\r\nimport yaml\r\nimport numpy as np  # Import numpy to access its version\r\nfrom mlflow.tracking import MlflowClient\r\n\r\nclient = mlflow.tracking.MlflowClient()\r\n\r\n# Create a temporary directory to work with artifacts\r\ntmp_dir = tempfile.mkdtemp()\r\n\r\nrun_id = \"\"  # Replace with your run id\r\n\r\ntry:\r\n    # Download and process conda.yaml\r\n    conda_artifact_path = f\"runs:/{run_id}/model/conda.yaml\"\r\n    conda_file_path = mlflow.artifacts.download_artifacts(artifact_uri=conda_artifact_path, dst_path=tmp_dir)\r\n    \r\n    with open(conda_file_path, 'r') as file:\r\n        conda_config = yaml.safe_load(file)\r\n    \r\n    # Check if numpy is listed under pip dependencies\r\n    pip_dependencies = conda_config.get(\"dependencies\", [])\r\n    pip_section = next((dep for dep in pip_dependencies if isinstance(dep, dict) and \"pip\" in dep), None)\r\n    \r\n    numpy_in_conda = False\r\n    if pip_section:\r\n        numpy_in_conda = any(pkg.startswith(\"numpy==\") for pkg in pip_section[\"pip\"])\r\n    \r\n    if not numpy_in_conda:\r\n        numpy_version = np.__version__\r\n        print(f\"Adding numpy=={numpy_version} to conda.yaml\")\r\n        \r\n        if not pip_section:\r\n            # If there's no pip section, create one\r\n            pip_section = {\"pip\": []}\r\n            conda_config[\"dependencies\"].append(pip_section)\r\n        \r\n        pip_section[\"pip\"].append(f\"numpy=={numpy_version}\")\r\n        \r\n        # Write the updated conda.yaml back to the file\r\n        with open(conda_file_path, 'w') as file:\r\n            yaml.dump(conda_config, file)\r\n        \r\n        # Log the updated conda.yaml back to MLflow\r\n        client.log_artifact(run_id=run_id, local_path=conda_file_path, artifact_path=\"model\")\r\n    \r\n    # Download and process requirements.txt\r\n    req_artifact_path = f\"runs:/{run_id}/model/requirements.txt\"\r\n    req_file_path = mlflow.artifacts.download_artifacts(artifact_uri=req_artifact_path, dst_path=tmp_dir)\r\n    \r\n    with open(req_file_path, 'r') as file:\r\n        requirements = [line.strip() for line in file.readlines()]\r\n    \r\n    numpy_in_requirements = any(pkg.startswith(\"numpy==\") for pkg in requirements)\r\n    \r\n    if not numpy_in_requirements:\r\n        numpy_version = np.__version__\r\n        print(f\"Adding numpy=={numpy_version} to requirements.txt\")\r\n        requirements.append(f\"numpy=={numpy_version}\")\r\n        \r\n        # Write the updated requirements.txt back to the file\r\n        with open(req_file_path, 'w') as file:\r\n            file.write(\"\\n\".join(requirements))\r\n        \r\n        # Log the updated requirements.txt back to MLflow\r\n        client.log_artifact(run_id=run_id, local_path=req_file_path, artifact_path=\"model\")\r\n        \r\nfinally:\r\n    # Clean up the temporary directory\r\n    shutil.rmtree(tmp_dir)\n3. After updating the artifacts, redeploy the endpoint to ensure consistent environments and prevent binary incompatibility errors.\nFor more information on supported formats for\nMlflow.artifacts.download_artifacts\n, refer to the MLflow\nmlflow.artifacts\nAPI documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/set-core-site-xml.json b/scraped_kb_articles/set-core-site-xml.json new file mode 100644 index 0000000000000000000000000000000000000000..2486a108ab28746d88bc4a09893b55913d56ffe8 --- /dev/null +++ b/scraped_kb_articles/set-core-site-xml.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/set-core-site-xml", + "title": "Título do Artigo Desconhecido", + "content": "You have a scenario that requires Apache Hadoop properties to be set.\nYou would normally do this in the\ncore-site.xml\nfile.\nIn this article, we explain how you can set\ncore-site.xml\nin a cluster.\nCreate the\ncore-site.xml\nfile in DBFS\nYou need to create a\ncore-site.xml\nfile and save it to DBFS on your cluster.\nAn easy way to create this file is via a bash script in a notebook.\nThis example code creates a\nhadoop-configs\nfolder on your cluster and then writes a single property\ncore-site.xml\nfile to that folder.\n%sh\r\n\r\nmkdir -p /dbfs/hadoop-configs/\r\ncat << 'EOF' > /dbfs/hadoop-configs/core-site.xml\r\n \r\n    \r\n    \r\n \r\nEOF\nYou can add multiple properties to the file by adding additional name/value pairs to the script.\nYou can also create this file locally, and then upload it to your cluster.\nCreate an init script that loads\ncore-site.xml\nThis example code creates an init script called\nset-core-site-configs.sh\nthat uses the\ncore-site.xml\nfile you just created.\nIf you manually uploaded a\ncore-site.xml\nfile and stored it elsewhere, you should update the\nconfig_xml\nvalue in the example code.\n%python\r\n\r\ndbutils.fs.put(\"/databricks/scripts/set-core-site-configs.sh\", \"\"\"\r\n#!/bin/bash\r\n   \r\necho \"Setting core-site.xml configs at `date`\"\r\n \r\nSTART_DRIVER_SCRIPT=/databricks/spark/scripts/start_driver.sh\r\nSTART_WORKER_SCRIPT=/databricks/spark/scripts/start_spark_slave.sh\r\n \r\nTMP_DRIVER_SCRIPT=/tmp/start_driver_temp.sh\r\nTMP_WORKER_SCRIPT=/tmp/start_spark_slave_temp.sh\r\n \r\nTMP_SCRIPT=/tmp/set_core-site_configs.sh\r\n \r\nconfig_xml=\"/dbfs/hadoop-configs/core-site.xml\"\r\n\r\ncat >\"$TMP_SCRIPT\" </{\r\n    r $config_xml\r\n    a \\\r\n    d\r\n}' /databricks/spark/dbconf/hadoop/core-site.xml\r\n \r\nEOL\r\ncat \"$TMP_SCRIPT\" > \"$TMP_DRIVER_SCRIPT\"\r\ncat \"$TMP_SCRIPT\" > \"$TMP_WORKER_SCRIPT\"\r\n \r\ncat \"$START_DRIVER_SCRIPT\" >> \"$TMP_DRIVER_SCRIPT\"\r\nmv \"$TMP_DRIVER_SCRIPT\" \"$START_DRIVER_SCRIPT\"\r\n \r\ncat \"$START_WORKER_SCRIPT\" >> \"$TMP_WORKER_SCRIPT\"\r\nmv \"$TMP_WORKER_SCRIPT\" \"$START_WORKER_SCRIPT\"\r\n \r\necho \"Completed core-site.xml config changes `date`\" \r\n \r\n\"\"\", True)\nAttach the init script to your cluster\nYou need to configure the newly created init script as a\ncluster-scoped init script\n.\nIf you used the example code, your\nDestination\nis\nDBFS\nand the\nInit Script Path\nis\ndbfs:/databricks/scripts/set-core-site-configs.sh\n.\nIf you customized the example code, ensure that you enter the correct path and name of the init script when you attach it to the cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/set-executor-log-level.json b/scraped_kb_articles/set-executor-log-level.json new file mode 100644 index 0000000000000000000000000000000000000000..4444f9377652dd58ec2910b7ff0862cd7256d5cd --- /dev/null +++ b/scraped_kb_articles/set-executor-log-level.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/set-executor-log-level", + "title": "Título do Artigo Desconhecido", + "content": "Delete\nWarning\nThis article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (\nCVE-2021-4104\n,\nCVE-2020-9488\n, and\nCVE-2019-17571\n). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilities.\nTo set the log level on all executors, you must set it inside the JVM on each worker.\nFor example:\n%scala\r\n\r\nsc.parallelize(Seq(\"\")).foreachPartition(x => {\r\n  import org.apache.log4j.{LogManager, Level}\r\n  import org.apache.commons.logging.LogFactory\r\n\r\n  LogManager.getRootLogger().setLevel(Level.DEBUG)\r\n  val log = LogFactory.getLog(\"EXECUTOR-LOG:\")\r\n  log.debug(\"START EXECUTOR DEBUG LOG LEVEL\")\r\n})\nTo verify that the level is set, navigate to the\nSpark UI\n, select the\nExecutors\ntab, and open the\nstderr\nlog for any executor:" +} \ No newline at end of file diff --git a/scraped_kb_articles/set-nullability-when-using-saveastable-with-delta-tables.json b/scraped_kb_articles/set-nullability-when-using-saveastable-with-delta-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..80897562c81d7d9eaae862f3b948f16fd822922d --- /dev/null +++ b/scraped_kb_articles/set-nullability-when-using-saveastable-with-delta-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/set-nullability-when-using-saveastable-with-delta-tables", + "title": "Título do Artigo Desconhecido", + "content": "When creating a Delta table with\nsaveAsTable\n, the nullability of columns defaults to\ntrue\n(columns can contain null values). This is expected behavior.\nIn some cases, you may want to create a Delta table with the nullability of columns set to\nfalse\n(columns cannot contain null values).\nInstructions\nUse the\nCREATE TABLE\ncommand to create the table and define the columns that cannot contain null values by using\nNOT NULL\n.\nFor example, this sample code creates a Delta table with two integer columns. The column named\nnull\ncan contain null values, but the column named\nnull1\ncannot contain null values because it was created with\nNOT NULL\n.\n%sql \r\n\r\nCREATE TABLE (\r\n num Int,\r\n num1 Int NOT NULL\r\n )\r\nUSING DELTA\nNow that we have the Delta table defined we can create a sample DataFrame and use\nsaveAsTable\nto write to the Delta table.\nThis sample code generates sample data and configures the schema with the\nisNullable\nproperty set to\ntrue\nfor the field\nnum\nand\nfalse\nfor field\nnum1\n. This sample data is stored in a newly created DataFrame.\nFor the final step,\nsaveAsTable\nis used to write the data to the table we previously created.\nimport org.apache.spark.sql.types._\r\nval data = Seq(\r\n Row(1, 3),\r\n Row(5, 7)\r\n)\r\n\r\nval schema = StructType(\r\n List(\r\n StructField(\"num\", IntegerType, true),\r\n StructField(\"num1\", IntegerType, false)\r\n )\r\n)\r\n\r\nval df = spark.createDataFrame(\r\n spark.sparkContext.parallelize(data),\r\n schema\r\n)\r\n\r\n\r\ndf.write.mode(\"overwrite\").format(\"delta\").saveAsTable(\"\")\nIf you read the table schema,\nnum\nallows for null values while\nnum1\ndoes not allow null values.\nroot\r\n |-- num: integer (nullable = true)\r\n |-- num1: integer (nullable = false)\nDelete\nWarning\nIf you do not configure the nullability of your columns by creating a table in advance and instead try to write data to an undefined table, the nullability of all columns defaults to\ntrue\n. The DataFrame scheme is ignored in this case.\nFor example, if you skip table creation and just try to write the data to a table with saveAsTable, and then read the schema, all columns are defined as being nullable." +} \ No newline at end of file diff --git a/scraped_kb_articles/set-up-embedded-metastore.json b/scraped_kb_articles/set-up-embedded-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..4cf9e55a37618ce994adc79739589eda40f30485 --- /dev/null +++ b/scraped_kb_articles/set-up-embedded-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/set-up-embedded-metastore", + "title": "Título do Artigo Desconhecido", + "content": "You can set up a Databricks cluster to use an embedded metastore. You can use an embedded metastore when you only need to retain table metadata during the life of the cluster. If the cluster is restarted, the metadata is lost.\nIf you need to persist the table metadata or other data after a cluster restart, then you should use the default metastore or set up an external metastore.\nThis example uses the Apache Derby embedded metastore, which is an in-memory lightweight database. Follow the instructions in the notebook to install the metastore.\nYou should always perform this procedure on a test cluster before applying it to other clusters.\nSet up an embedded Hive metastore notebook\nReview the\nembedded Hive metastore notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/shap-figure-not-appearing-in-the-artifacts-after-running-the-mflowevaluate-call-despite-setting-log_model_explainability-to-true.json b/scraped_kb_articles/shap-figure-not-appearing-in-the-artifacts-after-running-the-mflowevaluate-call-despite-setting-log_model_explainability-to-true.json new file mode 100644 index 0000000000000000000000000000000000000000..68b2f0e18a076f0420d449002960ba1b52f56bc9 --- /dev/null +++ b/scraped_kb_articles/shap-figure-not-appearing-in-the-artifacts-after-running-the-mflowevaluate-call-despite-setting-log_model_explainability-to-true.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/shap-figure-not-appearing-in-the-artifacts-after-running-the-mflowevaluate-call-despite-setting-log_model_explainability-to-true", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen executing the AutoML-generated code in a notebook, the SHAP figure does not appear in the artifacts after running the\nmflow.evaluate()\ncall under the\nDefine an objective function\nUI section, despite setting\nlog_model_explainability = True\n.\nYou receive the following error message.\n WARNING mlflow.models.evaluation.default_evaluator: Skip logging model explainability insights because the shap explainer None requires all feature values to be numeric, and each feature column must only contain scalar values.\nCause\nThe SHAP explainer requires all feature values to be numeric, and each feature column must only contain scalar values. If any categorical or non-scalar features are present, the SHAP figure generation is skipped.\nSolution\nEnsure that all features are in the expected format and re-run the call.\nConvert categorical features to numerical values using encoding techniques. Two common examples are One-Hot encoding or Ordinal encoding.\nEnsure that each feature column contains only scalar values.\nAfter preparing your data, run the\nmflow.evaluate()\ncall again with\nlog_model_explainability = True\n.\nVerify the SHAP figure is now generated in the artifacts." +} \ No newline at end of file diff --git a/scraped_kb_articles/shared-table-not-accessible-in-delta-sharing-using-python.json b/scraped_kb_articles/shared-table-not-accessible-in-delta-sharing-using-python.json new file mode 100644 index 0000000000000000000000000000000000000000..ecbd5cfb80fcaf92305274c55061aac33defd3a0 --- /dev/null +++ b/scraped_kb_articles/shared-table-not-accessible-in-delta-sharing-using-python.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/shared-table-not-accessible-in-delta-sharing-using-python", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou receive a\nFileNotFoundError\npointing to an Azure storage URL when trying to access a shared table in Delta Sharing using Python.\nExample\nimport delta_sharing\r\nprofile_file = \"\"\r\nclient = delta_sharing.SharingClient(profile_file)\r\n# List all shared tables.\r\nclient.list_all_tables()\r\n# Create a url to access a shared table.\r\ntable_url = profile_file + \"#..\"\r\n# Load a table as a Pandas DataFrame.\r\ndelta_sharing.load_as_pandas(table_url)\nError message:\nFileNotFoundError: https://.dfs.core.windows.net/
/part-00000.snappy.parquet?\nCause\nIf the Azure Blob Storage has network restrictions, it may prevent the Databricks service from accessing the required data.\nAdditionally, the storage principal or managed identity used may not have the required access permissions to the storage account. The STORAGE BLOB DATA CONTRIBUTOR role may not be assigned at the storage account level, or there may be a need to set the STORAGE BLOB DELEGATOR role at the storage account level and provide the STORAGE BLOB DATA CONTRIBUTOR role at the container level.\nSolution\nEnsure that the storage principal or managed identity used has the required access to the storage account. Assign the STORAGE BLOB DATA CONTRIBUTOR role at the storage account level.\nIf it is not possible to provide the STORAGE BLOB DATA CONTRIBUTOR role at the storage account level, set the STORAGE BLOB DELEGATOR role at the storage account level and provide the STORAGE BLOB DATA CONTRIBUTOR role at the container level.\nVerify that there are no network restrictions on the Azure Blob Storage that could prevent the Databricks service from accessing the required data.\nTest the access to the shared table again after making the necessary changes to the access permissions and network settings.\nFor more information on how to grant the managed identity access to the storage account, please refer to the\nUse Azure managed identities in Unity Catalog to access storage\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/shell-command-ls-does-not-work-on-dbfs-files-or-directories-when-using-a-shared-cluster.json b/scraped_kb_articles/shell-command-ls-does-not-work-on-dbfs-files-or-directories-when-using-a-shared-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..0fe72ef5c7fac9c973c4d9e4d627aa861bb21f3e --- /dev/null +++ b/scraped_kb_articles/shell-command-ls-does-not-work-on-dbfs-files-or-directories-when-using-a-shared-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/shell-command-ls-does-not-work-on-dbfs-files-or-directories-when-using-a-shared-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou try to perform the\n%sh ls\nshell command on a Databricks File System (DBFS) file or directory in your Unity Catalog-enabled workspace using a shared cluster. You grant\nSELECT\npermission on any file. Despite the permission, you receive the following error message.\nls: cannot access '/dbfs/{path}/': No such file or directory\nCause\nShared access mode has stricter access controls and limitations that affect file system operations. Specifically, shared access mode does not support FUSE for DBFS root and mounts, which means that direct file system operations using shell commands like\n%sh ls\ndo not work.\nFor details, review the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nThere are three options available.\nUse single access mode. Single access mode allows you to perform file system operations using shell commands like\n%sh ls\n.\nUse\ndbutils.fs.ls\ninstead of\n%sh ls\n. This command is designed to work with DBFS in Databricks and allows you to list the contents of a DBFS file in shared access mode.\nMigrate from DBFS mount points to volumes. Volumes allow you to perform file system operations using shell commands like\n%sh ls\nin shared access mode clusters. For more information, refer to the\nVolumes\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/shorten-cluster-provisioning-time-by-using-docker-containers.json b/scraped_kb_articles/shorten-cluster-provisioning-time-by-using-docker-containers.json new file mode 100644 index 0000000000000000000000000000000000000000..d2766e04239c0f8f6b50df4d5023a9b947522fed --- /dev/null +++ b/scraped_kb_articles/shorten-cluster-provisioning-time-by-using-docker-containers.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/shorten-cluster-provisioning-time-by-using-docker-containers", + "title": "Título do Artigo Desconhecido", + "content": "Problem:\nYou are running model predictions on scheduled jobs. However, every time you make a small change to your model package, it requires re-installation on the cluster, which slows down the provisioning process.\nSolution:\nTo shorten cluster provisioning time, you can leverage Docker container services.\nCreate a golden container environment with your required libraries pre-installed.\nUse the Docker container as the base for your cluster.\nModify the container to install any additional libraries specific to your project.\nProvision the cluster using the modified container.\nBy using Docker containers, you eliminate the need for each nodes to install a separate copy of the libraries, resulting in faster cluster provisioning.\nFor more information, refer to the Databricks documentation on custom containers (\nAWS\n|\nAzure\n).\nAdditionally, you can explore the\nDatabricks GitHub repository for containers\n, which provides base container examples you can customize." +} \ No newline at end of file diff --git a/scraped_kb_articles/show-databases-unexpected-name.json b/scraped_kb_articles/show-databases-unexpected-name.json new file mode 100644 index 0000000000000000000000000000000000000000..397beeca88741cd4fa6e64d75795851017dd164a --- /dev/null +++ b/scraped_kb_articles/show-databases-unexpected-name.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/show-databases-unexpected-name", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using the\nSHOW DATABASES\ncommand and it returns an unexpected column name.\nCause\nThe column name returned by the\nSHOW DATABASES\ncommand changed in Databricks Runtime 7.0.\nDatabricks Runtime 6.4 Extended Support and below:\nSHOW DATABASES\nreturns\nnamespace\nas the column name.\nDatabricks Runtime 7.0 and above:\nSHOW DATABASES\nreturns\ndatabaseName\nas the column name.\nSolution\nYou can enable legacy column naming by setting the property\nspark.sql.legacy.keepCommandOutputSchema\nto\nfalse\nin the cluster’s\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n)." +} \ No newline at end of file diff --git a/scraped_kb_articles/single-scheduled-job-tries-to-run-multiple-times.json b/scraped_kb_articles/single-scheduled-job-tries-to-run-multiple-times.json new file mode 100644 index 0000000000000000000000000000000000000000..f280fcd44ed973846e17670262286f8bb180b8a3 --- /dev/null +++ b/scraped_kb_articles/single-scheduled-job-tries-to-run-multiple-times.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/single-scheduled-job-tries-to-run-multiple-times", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou\nschedule a job\n(\nAWS\n|\nAzure\n|\nGCP\n) to run once per day, using Quartz Cron Syntax, but the job tries to run multiple times on the same day.\nCause\nWhen the job was configured, it was scheduled by manually entering the cron syntax and a special character\n*\nwas accidentally set for the seconds value. This tells the cron scheduler to run the job once every second.\nCron syntax specifies a time in the format\n \n. Numbers are used for the values and special characters can be used for multiple values.\nFor example, the cron syntax\n* 07 04 * * ?\ninstructs the system to attempt to start the job once every second from 04:07:00 to 04:07:59, every day.\nSolution\nYou need to specify a value for the seconds field. By default, Databricks uses 10 for the seconds field.\nBy changing\n*\nto\n10\nin the previous example, the cron scheduler only runs the job once per day, at 04:07:10.\nFor more information, review the Quartz Job Scheduler\nCronTrigger Tutorial\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/skew-hints-in-join.json b/scraped_kb_articles/skew-hints-in-join.json new file mode 100644 index 0000000000000000000000000000000000000000..2dde396533097c20188dc4c83f6ceb3066533e51 --- /dev/null +++ b/scraped_kb_articles/skew-hints-in-join.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/skew-hints-in-join", + "title": "Título do Artigo Desconhecido", + "content": "When you perform a\njoin\ncommand with\nDataFrame\nor\nDataset\nobjects, if you find that the query is stuck on finishing a small number of tasks due to\ndata skew,\nyou can specify the skew hint with the\nhint(\"skew\")\nmethod:\ndf.hint(\"skew\")\n. The skew join optimization (\nAWS\n|\nAzure\n|\nGCP\n) is performed on the\nDataFrame\nfor which you specify the\nskew\nhint.\nIn addition to the basic hint, you can specify the\nhint\nmethod with the following combinations of parameters: column name, list of column names, and column name and skew value.\nDataFrame\nand column name. The skew join optimization is performed on the specified column of the\nDataFrame\n.\n%python\r\n\r\ndf.hint(\"skew\", \"col1\")\nDataFrame\nand multiple columns. The skew join optimization is performed for multiple columns in the\nDataFrame\n.\n%python\r\n\r\ndf.hint(\"skew\", [\"col1\",\"col2\"])\nDataFrame\n, column name, and skew value. The skew join optimization is performed on the data in the column with the skew value.\n%python\r\n\r\ndf.hint(\"skew\", \"col1\", \"value\")\nExample\nThis example shows how to specify the skew hint for multiple\nDataFrame\nobjects involved in a\njoin\noperation:\n%scala\r\n\r\nval joinResults = ds1.hint(\"skew\").as(\"L\").join(ds2.hint(\"skew\").as(\"R\"), $\"L.col1\" === $\"R.col1\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/slack-alarm-notifications-test-fails-with-invalid_token.json b/scraped_kb_articles/slack-alarm-notifications-test-fails-with-invalid_token.json new file mode 100644 index 0000000000000000000000000000000000000000..eaf1bd2ed5be18afc5945940922cf5a8d35a6799 --- /dev/null +++ b/scraped_kb_articles/slack-alarm-notifications-test-fails-with-invalid_token.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/slack-alarm-notifications-test-fails-with-invalid_token", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/slow-autoscaling-external-metastore.json b/scraped_kb_articles/slow-autoscaling-external-metastore.json new file mode 100644 index 0000000000000000000000000000000000000000..a73c77481d8f5a5e79b60839b7cd26d374ea5cfa --- /dev/null +++ b/scraped_kb_articles/slow-autoscaling-external-metastore.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metastore/slow-autoscaling-external-metastore", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have an external metastore configured on your cluster and autoscaling is enabled, but the cluster is not autoscaling effectively.\nCause\nYou are copying the metastore jars to every executor, when they are only needed in the driver.\nIt takes time to initialize and run the jars every time a new executor spins up. As a result, adding more executors takes longer than it should.\nSolution\nYou should configure your cluster so the metastore jars are only copied to the driver.\nOption 1\n: Use an init script to copy the metastore jars.\nCreate a cluster with\nspark.sql.hive.metastore.jars\nset to\nmaven\nand\nspark.sql.hive.metastore.version\nto match the version of your metastore.\nStart the cluster and search the driver logs for a line that includes\nDownloaded metastore jars to\n.\n17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to \n\nis the location of the downloaded jars in the driver node of the cluster.\nCopy the jars to a DBFS location.\n%sh\r\n\r\ncp -r /dbfs/ExternalMetaStore_jar_location\nCreate the init script.\n%python\r\n\r\ndbutils.fs.put(\"dbfs:/databricks//external-metastore-jars-to-driver.sh\",\r\n\"\"\"\r\n#!/bin/bash\r\nif [[ $DB_IS_DRIVER = \"TRUE\" ]]; then\r\nmkdir -p /databricks/metastorejars/\r\ncp -r /dbfs/ExternalMetaStore_jar_location/* /databricks/metastorejars/\r\nfi\"\"\", True)\nInstall the init script that you just created as a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nYou will need the full path to the location of the script (\ndbfs:/databricks//external-metastore-jars-to-driver.sh\n).\nRestart the cluster.\nOption 2\n: Use the Apache Spark configuration settings to copy the metastore jars to the driver.\nEnter the following settings into your\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n):\nspark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://:/\r\nspark.hadoop.javax.jdo.option.ConnectionDriverName \r\nspark.hadoop.javax.jdo.option.ConnectionUserName \r\nspark.hadoop.javax.jdo.option.ConnectionPassword \r\nspark.sql.hive.metastore.version \r\nspark.sql.hive.metastore.jars /dbfs/metastore/jars/*\nThe source path can be external mounted storage or DBFS.\nThe metastore configuration can be applied globally within the workspace by using cluster policies (\nAWS\n|\nAzure\n|\nGCP\n).\nOption 3\n: Build a custom Databricks container with preloaded jars on\nAWS\nor\nAzure\n.\nReview the documentation on customizing containers with Databricks Container Services." +} \ No newline at end of file diff --git a/scraped_kb_articles/slow-model-fitting-when-implementing-alternating-least-squares-using-apache-spark-pyspark.json b/scraped_kb_articles/slow-model-fitting-when-implementing-alternating-least-squares-using-apache-spark-pyspark.json new file mode 100644 index 0000000000000000000000000000000000000000..5653f451579f147932a2cc723e44c50690fd1e09 --- /dev/null +++ b/scraped_kb_articles/slow-model-fitting-when-implementing-alternating-least-squares-using-apache-spark-pyspark.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/slow-model-fitting-when-implementing-alternating-least-squares-using-apache-spark-pyspark", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen implementing the\nAlternating Least Squares (ALS) algorithm using Apache Spark PySpark\n, you notice slow model fitting (even when using a significant number of computational resources), particularly in large datasets.\nCause\nSlow model fitting signals concurrent issues with compute-intensive operations associated with ALS and its default configurations. Compute-intensive operations involve high-grade matrix computations, which are CPU-intensive.\nRegarding default configuration, PySpark sets Spark partitions to ten, meaning only ten CPU cores are used even if compute has more available, leading to inefficient resource usage.\nSolution\nChange the default number of blocks to the total number of cores available in the compute.\nals = ALS()\r\nnum_cores = sc.defaultParallelism #Gets the total number of CPU cores on the cluster\r\nals.setNumBlocks(num_cores)#Overrides the number of blocks to the number of CPU cores\r\nmodel = als.fit()\nDatabricks also recommends using compute-optimized instances in order to obtain more cores.\nFurther reading: Empirical experiment\nThe following are the results of an experiment that consists of four trials to compare memory-optimized and compute-optimized computes. All trials use a consistent data size of 3.3 GB and incur approximately 6 DBUs per hour.\nTrials 1 and 2 use memory-optimized compute; trial 1 has a default setting of 10 cores, resulting in a processing time of 5 minutes. In trial 2, the ALS blocks are explicitly set to match the number of cores available (20), reducing the time to 3 minutes.\nTrials 3 and 4 use compute-optimized compute; trial 3 has a default setting of 10 cores and a processing time of 3 minutes. Hence, when it is compared with Trial 1, which uses memory-optimized compute, it is clear that the compute-optimized compute has better performance.\nIn Trial 4, the ALS blocks are explicitly set to match the number of cores available (40), and they turn out to have the lowest execution time, making them the optimal configuration for maximizing performance.\nData Size\nDBR usage\nCompute\nALS blocks\nTime\nTrial 1\n3.3GB\n6.12 DBU/hr\nMemory\nOptimized\nr61d.XL\n5 workers\nDefault (10)\n5 min\nTrial 2\n3.3GB\n6.12 DBU/hr\nMemory\nOptimized\nr61d.XL\n5 workers\n20 cores\n3 min\nTrial 3\n3.3GB\n6 DBU/hr\nCompute Optimized\nc4.2XL\n5 workers\nDefault (10)\n3 min\nTrial 4\nOptimal config\n3.3GB\n6 DBU/hr\nCompute Optimized\nc4.2XL\n5 workers\n40 cores\n2 min" +} \ No newline at end of file diff --git a/scraped_kb_articles/slowdown-from-root-disk-fill.json b/scraped_kb_articles/slowdown-from-root-disk-fill.json new file mode 100644 index 0000000000000000000000000000000000000000..6e4d136dc8a4a997bdefd0af5e9a85d1e994abe1 --- /dev/null +++ b/scraped_kb_articles/slowdown-from-root-disk-fill.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/slowdown-from-root-disk-fill", + "title": "Título do Artigo Desconhecido", + "content": "Delete\nNote\nThis article applies to Databricks Runtime 7.3 LTS and below.\nProblem\nClusters start slowing down and may show a combination of the following symptoms:\nUnhealthy cluster events are reported:\nRequest timed out. Driver is temporarily unavailable.\nMetastore is down.\nDBFS is down.\nYou do not see any high GC events or memory utilization associated with the driver process.\nWhen you use top on the driver node you see an intermittent high average load.\nThe Ganglia related gmetad process shows intermittent high CPU utilization.\nThe root disk shows high disk usage with\ndf -h /\n. Specifically,\n/var/lib/ganglia/rrds\nshows high disk usage.\nThe Ganglia UI is unable to show the load distribution.\nYou can verify the issue by looking for files with\nlocal\nin the prefix in\n/var/lib/ganglia/rrds\n. Generally, this directory should only have files prefixed with\napplication-\n.\nFor example:\n%sh ls -ltrhR /var/lib/ganglia/rrds/  | grep -i local\r\n\r\nrw-rw-rw- 1 ganglia ganglia 616K Jun 29 18:00 local-1593453624916.driver.Databricks.directoryCommit.markerReadErrors.count.rrd -rw-rw-rw- 1 ganglia ganglia 616K Jun 29 18:00 local-1593453614595.driver.Databricks.directoryCommit.deletedFilesFiltered.count.rrd -rw-rw-rw- 1 ganglia ganglia 616K Jun 29 18:00 local-1593453614595.driver.Databricks.directoryCommit.autoVacuumCount.count.rrd -rw-rw-rw- 1 ganglia ganglia 616K Jun 29 18:00 local-1593453605184.driver.CodeGenerator.generatedMethodSize.min.rrd\nCause\nGanglia metrics typically use less than 10GB of disk space. However, under certain circumstances, a “data explosion” can occur, which causes the root partition to fill with Ganglia metrics. Data explosions also create a dirty cache. When this happens, the Ganglia metrics can consume more than 100GB of disk space on root.\nThis “data explosion” can happen if you define the spark session variable as global in your Python file and then call functions defined in the same file to perform Apache Spark transformation on data. When this happens, the Spark session logic can be serialized, along with the required function definition, resulting in a Spark session being created on the worker node.\nFor example, take the following Spark session definition:\n%python\r\n\r\nfrom pyspark.sql import SparkSession\r\n\r\ndef get_spark():\r\n    \"\"\"Returns a spark session.\"\"\"\r\n    return SparkSession.builder.getOrCreate()\r\n\r\nif \"spark\" not in globals():\r\n  spark = get_spark()\r\n\r\ndef generator(partition):\r\n    print(globals()['spark'])\r\n    for row in partition:\r\n        yield [word.lower() for word in row[\"value\"]]\nIf you use the following example commands,\nlocal\nprefixed files are created:\n%python\r\n\r\nfrom repro import ganglia_test\r\ndf = spark.createDataFrame([([\"Hello\"], ), ([\"Spark\"], )], [\"value\"])\r\ndf.rdd.mapPartitions(ganglia_test.generator).toDF([\"value\"]).show()\nThe\nprint(globals()['spark'])\nstatement in the\ngenerator()\nfunction doesn’t result in an error, because it is available as a global variable in the worker nodes. It may fail with an invalid key error in some cases, as that value is not available as a global variable. Streaming jobs that execute on short batch intervals are susceptible to this issue.\nSolution\nEnsure that you are not using\nSparkSession.builder.getOrCreate()\nto define a Spark session as a global variable.\nWhen you troubleshoot, you can use the timestamps on files with the local prefix to help determine when a problematic change was first introduced." +} \ No newline at end of file diff --git a/scraped_kb_articles/slowness-when-fetching-results-in-databricks-sql.json b/scraped_kb_articles/slowness-when-fetching-results-in-databricks-sql.json new file mode 100644 index 0000000000000000000000000000000000000000..1d00ec5b63b3638a8e29fbd20689753e8461295e --- /dev/null +++ b/scraped_kb_articles/slowness-when-fetching-results-in-databricks-sql.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/slowness-when-fetching-results-in-databricks-sql", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDatabricks SQL uses cloud fetch to increase query performance. This is done by default.\nInstead of using single threaded queries, cloud fetch retrieves data in parallel from cloud storage buckets (such as AWS S3 and Azure Data Lake Storage). Compared to a standard, single threaded fetch, you can see up to a 10X increase in performance using cloud fetch.\nIf you are seeing slowness when fetching results in Databricks SQL it is likely that cloud fetch is disabled.\nThe following symptoms indicate an issue with cloud fetch:\nSlowness when retrieving results over ODBC/JDBC\nYour BI tools frequently get fetch time-outs while waiting for query results\nThe SQL warehouse query editor is slow\nCauses\nSome common issues that can result in cloud fetch being disabled:\nUsing an ODBC driver version below 2.6.17\nUsing a JDBC driver version below 2.6.18\nFirewall/ACL issues between your workspace and cloud storage\nCloud provider versioning is enabled on the cloud storage you are using\nSolution\nEnsure you are using a\nDatabricks ODBC driver\nversion 2.6.17 or above.\nEnsure you are using a\nDatabricks JDBC driver\nversion 2.6.18 or above.\nEnsure your ODBC/JDBC\nAuthentication requirements\n(\nAWS\n|\nAzure\n|\nGCP\n) are properly configured.\nDisable storage bucket versioning on the cloud storage you are using to store your data." +} \ No newline at end of file diff --git a/scraped_kb_articles/slowness-when-using-the-foundational-model-api-with-pay-per-token-mode.json b/scraped_kb_articles/slowness-when-using-the-foundational-model-api-with-pay-per-token-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..e3b3e88b15a222a77a088a2087b66d7d98a01f0d --- /dev/null +++ b/scraped_kb_articles/slowness-when-using-the-foundational-model-api-with-pay-per-token-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/slowness-when-using-the-foundational-model-api-with-pay-per-token-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you access Databricks Foundation Model APIs in pay-per-token mode, you notice extended response times and reduced model inference operation efficiency, particularly with first-token generation.\nCause\nThe Foundation Model APIs pay-per-token mode is a multi-tenant service. When multiple customers send requests with long contexts, they consume most of the available GPU resources. As a result, the first token latency can be significantly delayed.\nThis mode is not designed for high-throughput applications or performant production workloads.\nSolution\nFor production workloads requiring:\nHigh throughput\nPerformance guarantees\nFine-tuned models\nEnhanced security requirements\nDatabricks recommends choosing the provisioned throughput mode instead. It is specifically designed to meet these production-grade requirements.\nFor more information, please refer to the\nDatabricks Foundation Model APIs\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/sort-failed-after-writing-partitioned-data-to-parquet-using-pyspark-on-databricks-runtime-133-lts.json b/scraped_kb_articles/sort-failed-after-writing-partitioned-data-to-parquet-using-pyspark-on-databricks-runtime-133-lts.json new file mode 100644 index 0000000000000000000000000000000000000000..31a8c869fbaa72a0e69456dfa52d56ed45df66bd --- /dev/null +++ b/scraped_kb_articles/sort-failed-after-writing-partitioned-data-to-parquet-using-pyspark-on-databricks-runtime-133-lts.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/sort-failed-after-writing-partitioned-data-to-parquet-using-pyspark-on-databricks-runtime-133-lts", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn Databricks Runtime 13.3 LTS to 15.3, when using\nsortWithinPartitions\nto make sure the rows in each partition are ordered based on the columns, the sorted data frame looks correct when displayed, but after saving and reading it back, the sorting is lost.\nCause\nThere is an issue in which the planned write local sort comes after the\nsortWithinPartitions\nlocal sort, and then\nEliminateSorts\ndrops the first sort as unnecessary. This behavior occurs with or without Photon.\nSolution\nThis issue is fixed in Databricks Runtime 15.4 LTS.\nIf upgrading is not an option, set the below Apache Spark configuration as a workaround.\nspark.conf.set(\"spark.sql.optimizer.plannedWrite.enabled\", \"false\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-default-perms-adls-gen1.json b/scraped_kb_articles/spark-default-perms-adls-gen1.json new file mode 100644 index 0000000000000000000000000000000000000000..3565feabf77e9526c40c04fbdaaff3da1f5c1592 --- /dev/null +++ b/scraped_kb_articles/spark-default-perms-adls-gen1.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/spark-default-perms-adls-gen1", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are using Azure Databricks and have a Spark job that is writing to ADLS Gen1 storage.\nWhen you try to manually read, write, or delete data in the folders you get an error message.\nForbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation\nCause\nWhen writing data to ADLS Gen1 storage, Apache Spark uses the service principal as the owner of the files it creates. The service principal is defined in\ndfs.adls.oauth2.client.id\n.\nWhen files are created, they inherit the default permissions from the Hadoop filesystem. The Hadoop filesystem has a default permission of 666 (-rw-rw-rw-) and a default umask of 022, which results in the 644 permission setting as the default for files.\nWhen folders are created, they inherit the parent folder permissions, which are 770 by default.\nBecause the owner is the service principal and not the user, you don’t have permission to access the folder due to the 0 bit in the folder permissions.\nSolution\nOption 1\nMake the service principal user part of the same group as the default user. This will allow access when accessing storage through the portal.\nPlease reach out to Microsoft support for assistance.\nOption 2\nCreate a base folder in ADLS Gen1 and set the permissions to 777. Write Spark output under this folder. Because folders created by Spark inherit the parent folder permissions, all folders created by Spark will have 777 permissions. This allows any user to access the folders.\nOption 3\nChange the default umask from 022 to 000 on your Azure Databricks clusters.\nSet\nspark.hadoop.fs.permissions.umask-mode 000\nin the\nSpark config\nfor your cluster.\nWith a umask of 000, the default Hadoop filesystem permission of 666 becomes the default permission used when Azure Databricks creates objects." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-executor-memory.json b/scraped_kb_articles/spark-executor-memory.json new file mode 100644 index 0000000000000000000000000000000000000000..716679fde9232e0cb591b15f821068220811a99d --- /dev/null +++ b/scraped_kb_articles/spark-executor-memory.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/spark-executor-memory", + "title": "Título do Artigo Desconhecido", + "content": "By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM) memory heap. This is controlled by the\nspark.executor.memory\nproperty.\nHowever, some unexpected behaviors were observed on instances with a large amount of memory allocated. As JVMs scale up in memory size, issues with the garbage collector become apparent. These issues can be resolved by limiting the amount of memory under garbage collector management.\nSelected Databricks cluster types enable the off-heap mode, which limits the amount of memory under garbage collector management. This is why certain Spark clusters have the\nspark.executor.memory\nvalue set to a fraction of the overall cluster memory.\nThe off-heap mode is controlled by the properties\nspark.memory.offHeap.enabled\nand\nspark.memory.offHeap.size\nwhich are available in Spark 1.6.0 and above.\nAWS\nThe following Databricks cluster types enable the off-heap memory policy:\nc5d.18xlarge\nc5d.9xlarge\ni3.16xlarge\ni3en.12xlarge\ni3en.24xlarge\ni3en.2xlarge\ni3en.3xlarge\ni3en.6xlarge\ni3en.large\ni3en.xlarge\nm4.16xlarge\nm5.24xlarge\nm5a.12xlarge\nm5a.16xlarge\nm5a.24xlarge\nm5a.8xlarge\nm5d.12xlarge\nm5d.24xlarge\nm5d.4xlarge\nr4.16xlarge\nr5.12xlarge\nr5.16xlarge\nr5.24xlarge\nr5.2xlarge\nr5.4xlarge\nr5.8xlarge\nr5a.12xlarge\nr5a.16xlarge\nr5a.24xlarge\nr5a.2xlarge\nr5a.4xlarge\nr5a.8xlarge\nr5d.12xlarge\nr5d.24xlarge\nr5d.2xlarge\nr5d.4xlarge\nz1d.2xlarge\nz1d.3xlarge\nz1d.6xlarge\nz1d.6xlarge\nDelete\nAzure\nThe following Azure Databricks cluster types enable the off-heap memory policy:\nStandard_L8s_v2\nStandard_L16s_v2\nStandard_L32s_v2\nStandard_L64s_v2\nStandard_L80s_v2\nDelete" +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-image-download-failure-error-message.json b/scraped_kb_articles/spark-image-download-failure-error-message.json new file mode 100644 index 0000000000000000000000000000000000000000..ad633972635d5f930cf3a5480464bb0f34cddac7 --- /dev/null +++ b/scraped_kb_articles/spark-image-download-failure-error-message.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/spark-image-download-failure-error-message", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYour all-purpose clusters are failing to launch, and your jobs are failing to run. You are seeing error messages related to the Apache Spark image not downloading or not existing.\nSpark image download failure.\nSpark image failed to download or does not exist.\nCause\nA\nSpark image download failure\nerror usually indicates network latency or configuration problems that are blocking traffic to the storage account hosting the Spark image.\nSolution\nCheck your Databricks VNet configuration to make sure traffic to your storage is not blocked by any of the following:\nDNS\n- By default, resources deployed to a VNet use Azure DNS for domain name resolution. If you are using a custom DNS server, you must configure your custom DNS to forward these requests to the Azure recursive resolver (\n168.63.129.16\n) to resolve the IP addresses for Azure artifacts. Review the\nConfigure custom DNS\ndocumentation for more information.\nYou can check your DNS settings in the Azure portal. From there, navigate to your Databricks VNet and select \"DNS servers\" from the \"Settings\" menu. Then you can add Azure recursive resolver IP address to the list of DNS servers as in screenshot below.\nFirewall\n- If you have a firewall enabled on the VNet, review the settings and ensure it is not blocking traffic to storage.\nYou can check your firewall settings in the Azure portal. From there, navigate to your Databricks VNet and select\n\"Firewall\" from the \"Settings\" menu.\nIf firewall rules have been created, you can view them there.\nNetwork Security Group (NSG)\n- Verify the network security group includes all required\nNSG rules\n.\nYou can check your NSG rules settings by going to the Azure portal and navigating to your VNet. From there, select \"Subnets\" from the \"Settings\" menu, and choose the security group for both subnets. Double check both inbound and outbound rules and confirm all required ones are added and traffic to Storage is not blocked.\nUser-defined routes and service endpoints\n- Verify that the route table includes all required\nuser-defined routes\n. If you use service endpoints rather than user-defined routes for Blob storage, check those endpoints as well.\nYou can check your UDR settings by going to the Azure portal and navigating to your VNet. From there, select \"Subnets\" from the \"Settings\" menu, and choose the Route table for both subnets. Double check the routes and confirm all required ones are added and traffic to Storage is not blocked.\nFor service endpoints, from the VNet page, select \"Service endpoints\" from the \"Settings\" menu.\nIf service endpoints have been created, you can view them there\n.\nOn the other hand, as a best practice, consider setting up a\nDisaster recovery\nsolution to minimize any service issues impact. Ensure the right people in your organization are notified about any service issues by configuring\nAzure Service Health alerts\n. These alerts can trigger emails, SMS, push notifications, webhooks, and more." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-jar-job-error.json b/scraped_kb_articles/spark-jar-job-error.json new file mode 100644 index 0000000000000000000000000000000000000000..c380d11f0aafaae52c5aece6d4ea0c0b5caeb13e --- /dev/null +++ b/scraped_kb_articles/spark-jar-job-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/spark-jar-job-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIf you run multiple Apache Spark JAR jobs concurrently, some of the runs might fail with the error:\norg.apache.spark.sql.AnalysisException: Table or view not found: xxxxxxx; line 1 pos 48\nCause\nThis error occurs due to a bug in Scala. When an object extends\nApp\n, its\nval\nfields are no longer immutable and they can be changed when the\nmain\nmethod is called. If you run JAR jobs multiple times, a\nval\nfield containing a DataFrame can be changed inadvertently.\nAs a result, when any one of the concurrent runs finishes, it wipes out the temporary views of the other runs.\nScala issue 11576\nprovides more detail.\nSolution\nTo work around this bug, call the\nmain()\nmethod explicitly. As an example, if you have code similar to this:\n%scala\r\n\r\n  object MainTest extends App {\r\n    ...\r\n  }\nYou can replace it with code that does not extend\nApp\n:\n%scala\r\n\r\n  object MainTest {\r\n    def main(args: Array[String]) {\r\n    ......\r\n    }\r\n  }" +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-job-fail-parquet-column-convert.json b/scraped_kb_articles/spark-job-fail-parquet-column-convert.json new file mode 100644 index 0000000000000000000000000000000000000000..5838a146f64ff67467dc750127c6f5a4fdfbb5b4 --- /dev/null +++ b/scraped_kb_articles/spark-job-fail-parquet-column-convert.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/spark-job-fail-parquet-column-convert", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are reading data in Parquet format and writing to a Delta table when you get a\nParquet column cannot be converted\nerror message.\nThe cluster is running Databricks Runtime 7.3 LTS or above.\norg.apache.spark.SparkException: Task failed while writing rows.\r\nCaused by: com.databricks.sql.io.FileReadException: Error while reading file s3://bucket-name/landing/edw/xxx/part-xxxx-tid-c00.snappy.parquet. Parquet column cannot be converted. Column: [Col1], Expected: DecimalType(10,0), Found: FIXED_LEN_BYTE_ARRAY\r\n\r\nCaused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException.\nCause\nThe vectorized Parquet reader is decoding the decimal type column to a binary format.\nThe vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and above for reading datasets in Parquet files. The read schema uses atomic data types: binary, boolean, date, string, and timestamp.\nDelete\nInfo\nThis error only occurs if you have decimal type columns in the source data.\nSolution\nIf you have decimal type columns in your source data, you should disable the vectorized Parquet reader.\nSet\nspark.sql.parquet.enableVectorizedReader\nto\nfalse\nin the cluster’s Spark configuration to disable the vectorized Parquet reader at the cluster level.\nYou can also disable the vectorized Parquet reader at the notebook level by running:\n%scala\r\n\r\nspark.conf.set(\"spark.sql.parquet.enableVectorizedReader\",\"false\")\nDelete\nInfo\nThe vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality, and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact. You should only disable it, if you have decimal type columns in your source data." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-job-not-starting.json b/scraped_kb_articles/spark-job-not-starting.json new file mode 100644 index 0000000000000000000000000000000000000000..3df779f6519dd0c61e68dbd7de9539260b96bec7 --- /dev/null +++ b/scraped_kb_articles/spark-job-not-starting.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/spark-job-not-starting", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nNo Spark jobs start, and the driver logs contain the following error:\nInitial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\nCause\nThis error can occur when the executor memory and number of executor cores are set explicitly on the\nSpark Config\ntab.\nHere is a sample config:\nAWS\nIn this example, the executor is set to a\ni3.xLarge\nnode, and the\nSpark Config\nis set to:\nspark.executor.cores 5\r\nspark.executor.memory 6G\nThe\ni3.xLarge\ncluster type only has 4 cores but a user has set 5 cores per executor explicitly. Spark does not start any tasks, and enters the following error messages into the driver logs:\nWARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\r\nWARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\r\nWARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\nDelete\nGCP\nIn this example, the executor is set to a\nn1-standard-4\nnode, and the\nSpark Config\nis set to:\nspark.executor.cores 5\r\nspark.executor.memory 6G\nThe\nn1-standard-4\ncluster type only has 4 cores but a user has set 5 cores per executor explicitly. Spark does not start any tasks, and enters the following error messages into the driver logs:\nWARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\r\nWARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\r\nWARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\nDelete\nSolution\nYou should never specify cores greater than the available number of cores on the node that you chose for a cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-metrics.json b/scraped_kb_articles/spark-metrics.json new file mode 100644 index 0000000000000000000000000000000000000000..3b6802f8179410d3975db6a2e288cff1dbce818e --- /dev/null +++ b/scraped_kb_articles/spark-metrics.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/metrics/spark-metrics", + "title": "Título do Artigo Desconhecido", + "content": "This article gives an example of how to monitor Apache Spark components using the\nSpark configurable metrics system\n. Specifically, it shows how to set a new source and enable a sink.\nFor detailed information about the Spark components available for metrics collection, including sinks supported out of the box, follow the documentation link above.\nDelete\nInfo\nThere are several other ways to collect metrics to get insight into how a Spark job is performing, which are also not covered in this article:\nSparkStatusTracker (\nSource\n,\nAPI\n): monitor job, stage, or task progress\nStreamingQueryListener (\nSource\n,\nAPI\n): intercept streaming events\nSparkListener (\nSource\n): intercept events from Spark scheduler\nFor information about using other third-party tools to monitor Spark jobs in Databricks, see Monitor performance (\nAWS\n|\nAzure\n).\nHow does this metrics collection system work? Upon instantiation, each executor creates a connection to the driver to pass the metrics.\nThe first step is to write a class that extends the\nSource\ntrait:\n%scala\r\n\r\nclass MySource extends Source {\r\n  override val sourceName: String = \"MySource\"\r\n\r\n  override val metricRegistry: MetricRegistry = new MetricRegistry\r\n\r\n  val FOO: Histogram = metricRegistry.histogram(MetricRegistry.name(\"fooHistory\"))\r\n  val FOO_COUNTER: Counter = metricRegistry.counter(MetricRegistry.name(\"fooCounter\"))\r\n}\nThe next step is to enable the sink. In this example, the metrics are printed to the console:\n%scala\r\n\r\nval spark: SparkSession = SparkSession\r\n    .builder\r\n    .master(\"local[*]\")\r\n    .appName(\"MySourceDemo\")\r\n    .config(\"spark.driver.host\", \"localhost\")\r\n    .config(\"spark.metrics.conf.*.sink.console.class\", \"org.apache.spark.metrics.sink.ConsoleSink\")\r\n.getOrCreate()\nDelete\nInfo\nTo sink metrics to Prometheus, you can use this third-party library:\nhttps://github.com/banzaicloud/spark-metrics\n.\nThe last step is to instantiate the source and register it with SparkEnv:\n%scala\r\n\r\nval source: MySource = new MySource\r\nSparkEnv.get.metricsSystem.registerSource(source)\nYou can view a complete, buildable example at\nhttps://github.com/newroyker/meter\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-ml-to-onnx-model-conversion-does-not-produce-the-same-model-predictions-differ.json b/scraped_kb_articles/spark-ml-to-onnx-model-conversion-does-not-produce-the-same-model-predictions-differ.json new file mode 100644 index 0000000000000000000000000000000000000000..f0f2c63cc9a2d9777cc20531783e1b2bf75c1d11 --- /dev/null +++ b/scraped_kb_articles/spark-ml-to-onnx-model-conversion-does-not-produce-the-same-model-predictions-differ.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/spark-ml-to-onnx-model-conversion-does-not-produce-the-same-model-predictions-differ", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have successfully converted your Apache Spark ML model or Spark ML pipeline to the ONNX format in a notebook, and although no errors were raised, you noticed that the predictions produced by the ONNX model differ from those produced by the original model on the same input dataset.\nImportant\nMLeap is no longer available as of Databricks Runtime 15.1 ML and above. Databricks recommends using the ONNX format to package models for deployment on JVM-based frameworks. For more information please review the\nDatabricks Runtime 15.1 for Machine Learning (EoS)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCause\nA\ntarget_opset\nvalue was not provided when calling the\nconvert_sparkml\nfunction, which produces the ONNX model.\nContext\nThe\ntarget_opset\nis a parameter used during the conversion of machine learning models to the ONNX format, specifying the\nopset\nversion that the converter should use when translating the model's operations into ONNX.\nWhen converting a model from a framework like SparkML to ONNX, each operation in the model (for example, a decision tree split or activation function) must be mapped to its ONNX equivalent. The\ntarget_opset\ndictates which version of these operator definitions the converter should adhere to during this mapping.\nSolution\nDefine the\nTARGET_OPSET\nand then pass it as the\ntarget_opset\nparameter of the\nconvert_sparkml\nfunction. This ensures that the converted model is equivalent to its original model, and the model’s operator versions are defined accordingly.\nfrom onnx.defs import onnx_opset_version\r\nfrom onnxconverter_common.onnx_ex import DEFAULT_OPSET_NUMBER\r\nfrom onnxmltools.convert.common.data_types import FloatTensorType, Int64TensorType\r\n\r\nTARGET_OPSET = min(DEFAULT_OPSET_NUMBER, onnx_opset_version())\r\n\r\nspark.conf.set(\"ONNX_DFS_PATH\", \"file:///dbfs/onnx_tmp\")\r\n\r\nonx_r_forest_model = onnxmltools.convert_sparkml(\r\nmodel=spark_model_object,\r\nname=\"onnx_model\",\r\ninitial_types=[(\"features\", FloatTensorType([None, 3]))],\r\nspark_session=spark, \r\ntarget_opset=TARGET_OPSET)" +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-overwrite-cancel.json b/scraped_kb_articles/spark-overwrite-cancel.json new file mode 100644 index 0000000000000000000000000000000000000000..54849df3c83086c6a68c6963a99fb088baa497a1 --- /dev/null +++ b/scraped_kb_articles/spark-overwrite-cancel.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/spark-overwrite-cancel", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:\nError: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`').\r\nThe associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;\nCause\nThis problem is due to a change in the default behavior of Spark in version 2.4.\nThis problem can occur if:\nThe cluster is terminated while a write operation is in progress.\nA temporary network issue occurs.\nThe job is interrupted.\nOnce the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. Basically, the problem is that a metadata directory called\n_STARTED\nisn’t deleted automatically when Databricks tries to overwrite it.\nYou can reproduce the problem by following these steps:\nCreate a DataFrame:\nval df = spark.range(1000)\nWrite the DataFrame to a location in overwrite mode:\ndf.write.mode(SaveMode.Overwrite).saveAsTable(\"testdb.testtable\")\nCancel the command while it is executing.\nRe-run the\nwrite\ncommand.\nSolution\nSet the flag\nspark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation\nto\ntrue\n. This flag deletes the\n_STARTED\ndirectory and returns the process to the original state. For example, you can set it in the notebook:\n%python\r\n\r\nspark.conf.set(\"spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation\",\"true\")\nOr you can set it in the cluster level\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n):\nspark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true\nAnother option is to manually clean up the data directory specified in the error message. You can do this with\ndbutils.fs.rm\n.\n%scala\r\n\r\ndbutils.fs.rm(\"\", true)" +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-serialized-task-is-too-large.json b/scraped_kb_articles/spark-serialized-task-is-too-large.json new file mode 100644 index 0000000000000000000000000000000000000000..10e8158153f9faf2c6dbbeb4d98928f810a000ee --- /dev/null +++ b/scraped_kb_articles/spark-serialized-task-is-too-large.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/execution/spark-serialized-task-is-too-large", + "title": "Título do Artigo Desconhecido", + "content": "If you see the follow error message, you may be able to fix this error by changing the\nSpark config\n(\nAWS\n|\nAzure\n) when you start the cluster.\nSerialized task XXX:XXX was XXX bytes, which exceeds max allowed: spark.rpc.message.maxSize (XXX bytes).\r\nConsider increasing spark.rpc.message.maxSize or using broadcast variables for large values.\nTo change the\nSpark config\n, set the property:\nspark.rpc.message.maxSize\nWhile tuning the configuration is one option, typically this error message means that you send some large objects from the driver to executors, e.g., call\nparallelize\nwith a large list, or convert a large R DataFrame to a Spark DataFrame.\nIf so, we recommend first auditing your code to remove large objects that you use, or leverage broadcast variables instead. If that does not resolve this error, you can increase the partition number to split the large list to multiple small ones to reduce the Spark RPC message size.\nHere are examples for Python and Scala:\nPython\nlargeList = [...] # This is a large list\r\npartitionNum = 100 # Increase this number if necessary\r\nrdd = sc.parallelize(largeList, partitionNum)\r\nds = rdd.toDS()\nDelete\nScala\nval largeList = Seq(...) // This is a large list\r\nval partitionNum = 100 // Increase this number if necessary\r\nval rdd = sc.parallelize(largeList, partitionNum)\r\nval ds = rdd.toDS()\nDelete\nR users need to increase the Spark configuration\nspark.default.parallelism\nto increase the partition number at cluster initialization. You cannot set this configuration after cluster creation." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-shows-less-memory.json b/scraped_kb_articles/spark-shows-less-memory.json new file mode 100644 index 0000000000000000000000000000000000000000..1104fd361facd1383b3dac6b405ebfd51e7b866e --- /dev/null +++ b/scraped_kb_articles/spark-shows-less-memory.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/spark-shows-less-memory", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe\nExecutors\ntab in the Spark UI shows less memory than is actually available on the node:\nAWS\nAn\nm4.xlarge\ninstance (16 GB ram, 4 core) for the driver node, shows 4.5 GB memory on the\nExecutors\ntab.\nAn\nm4.large\ninstance (8 GB ram, 2 core) for the driver node, shows 710 MB memory on the\nExecutors\ntab:\nDelete\nAzure\nAn F8s instance (16 GB, 4 core) for the driver node, shows 4.5 GB of memory on the\nExecutors\ntab.\nAn F4s instance (8 GB, 4 core) for the driver node, shows 710 MB of memory on the\nExecutors\ntab:\nDelete\nCause\nThe total amount of memory shown is less than the memory on the cluster because some memory is occupied by the kernel and node-level services.\nSolution\nTo calculate the available amount of memory, you can use the formula used for executor memory allocation\n(all_memory_size * 0.97 - 4800MB) * 0.8\n, where:\n0.97 accounts for kernel overhead.\n4800 MB accounts for internal node-level services (node daemon, log daemon, and so on).\n0.8 is a heuristic to ensure the LXC container running the Spark process doesn’t crash due to out-of-memory errors.\nTotal available memory for storage on an instance is\n(8192MB * 0.97 - 4800MB) * 0.8 - 1024\n= 1.2 GB. Because the parameter\nspark.memory.fraction\nis by default 0.6, approximately\n(1.2 * 0.6)\n= ~710 MB is available for storage.\nYou can change the\nspark.memory.fraction\nSpark configuration (\nAWS\n|\nAzure\n) to adjust this parameter. Calculate the available memory for a new parameter as follows:\nIf you use an instance, which has 8192 MB memory, it has available memory 1.2 GB.\nIf you specify a\nspark.memory.fraction\nof 0.8, the\nExecutors\ntab in the Spark UI should show:\n(1.2 * 0.8)\nGB = ~960 MB." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-submit-fail-parse-byte-string.json b/scraped_kb_articles/spark-submit-fail-parse-byte-string.json new file mode 100644 index 0000000000000000000000000000000000000000..5519ea2c8898d838c9019f2576b83135a791d6d4 --- /dev/null +++ b/scraped_kb_articles/spark-submit-fail-parse-byte-string.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/spark-submit-fail-parse-byte-string", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nSpark-submit jobs fail with a\nFailed to parse byte string: -1\nerror message.\njava.util.concurrent.ExecutionException: java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.\r\nFailed to parse byte string: -1\r\nat java.util.concurrent.FutureTask.report(FutureTask.java:122)\r\nat java.util.concurrent.FutureTask.get(FutureTask.java:206)\r\nat org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:182)\r\n... 108 more\r\nCaused by: java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.\r\nFailed to parse byte string: -1\nCause\nThe value of the\nspark.driver.maxResultSize\napplication property is negative.\nSolution\nThe value assigned to\nspark.driver.maxResultSize\ndefines the maximum size (in bytes) of the serialized results for each Spark action. You can assign a positive value to the\nspark.driver.maxResultSize\nproperty to define a specific size. You can also assign a value of 0 to define an unlimited maximum size. You cannot assign a negative value to this property.\nIf the total size of a job is above the\nspark.driver.maxResultSize\nvalue, the job is aborted.\nYou should be careful when setting an excessively high (or unlimited) value for\nspark.driver.maxResultSize\n. A high limit can cause out-of-memory errors in the driver if the\nspark.driver.memory\nproperty is not set high enough.\nSee\nSpark Configuration Application Properties\nfor more details." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-udf-performance.json b/scraped_kb_articles/spark-udf-performance.json new file mode 100644 index 0000000000000000000000000000000000000000..0975afccec65900a67becb90cc60ee6ad24f69c5 --- /dev/null +++ b/scraped_kb_articles/spark-udf-performance.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/spark-udf-performance", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nSometimes Apache Spark jobs hang indefinitely due to the non-deterministic behavior of a Spark User-Defined Function (UDF). Here is an example of such a function:\n%scala\r\n\r\nval convertorUDF = (commentCol: String) =>\r\n    {\r\n              #UDF definition\r\n    }\r\nval translateColumn = udf(convertorUDF)\nIf you call this UDF using the\nwithColumn()\nAPI and then apply some filter transformation on the resulting\nDataFrame\n, the UDF could potentially execute multiple times for each record, affecting application performance.\n%scala\r\n\r\nval translatedDF = df.withColumn(\"translatedColumn\", translateColumn( df(\"columnToTranslate\")))\r\nval filteredDF = translatedDF.filter(!translatedDF(\"translatedColumn\").contains(\"Invalid URL Provided\")) && !translatedDF(\"translatedColumn\").contains(\"Unable to connect to Microsoft API\"))\nCause\nSometimes a deterministic UDF can behave nondeterministically, performing duplicate invocations depending on the definition of the UDF. You often see this behavior when you use a UDF on a DataFrame to add an additional column using the\nwithColumn()\nAPI, and then apply a transformation (filter) to the resulting\nDataFrame\n.\nSolution\nUDFs must be deterministic. Due to optimization, duplicate invocations might be eliminated or the function can be invoked more times than it is present in the query.\nThe better option is to cache the\nDataFrame\nwhere you are using the UDF. If the\nDataFrame\ncontains a large amount of data, then writing it to a Parquet format file is optimal.\nYou can use the following code to cache the result:\n%scala\r\n\r\nval translatedDF = df.withColumn(\"translatedColumn\", translateColumn( df(\"columnToTranslate\"))).cache()" +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-ui-is-empty-for-the-job-clusters-after-termination.json b/scraped_kb_articles/spark-ui-is-empty-for-the-job-clusters-after-termination.json new file mode 100644 index 0000000000000000000000000000000000000000..419661756afcf50a668165b0214aa144c54427e0 --- /dev/null +++ b/scraped_kb_articles/spark-ui-is-empty-for-the-job-clusters-after-termination.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/spark-ui-is-empty-for-the-job-clusters-after-termination", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have jobs executing Python/SQL commands, but you're unable to view the Apache Spark UI details on the associated clusters. Specifically, the Spark UI does not display any information under the\nJobs\n,\nStages\n, or\nStorage\ntabs.\nCause\nThis is the expected behavior when executing non-Spark operations or API based calls that do not involve a Spark execution context.\nThe Spark UI only displays details when a Spark job—such as a PySpark or Spark SQL operation—is executed within an active Spark session. It helps you monitor, debug, and optimize Spark jobs by offering real-time and historical views of job execution.\nSolution\nThe Spark UI provides insights into Spark job execution, including job progress, stages, and resource utilization. However, it only displays information when Spark operations are executed.\nThe Spark UI populates details when an active Spark session runs:\nRDD transformations\nSpark SQL queries\nDataFrame operations\nFor example, the following PySpark code triggers a Spark job, which appears in the Spark UI.\n%python\r\n\r\n# Create a simple DataFrame\r\ndf = spark.createDataFrame([(1, \"Alice\"), (2, \"Bob\")], [\"id\", \"name\"])\r\ndf.show()\nIf no Spark jobs are triggered, the Spark UI does not display any details.\nFor example, the following pure Python code does not interact with Spark and does not appear in the UI:\n%python\r\n\r\ndata = [(1, \"Alice\"), (2, \"Bob\")]\r\n# Process data using pure Python\r\nprocessed_data = [f\"ID: {id}, Name: {name}\" for id, name in data]\r\n\r\n# Print the result\r\nfor item in processed_data:\r\n   print(item)\nSpark SQL queries remain visible in the\nSQL\n/\nDataFrame\ntab in the cluster’s Spark UI, even after the compute is terminated.\nFor example, this SQL code appears in the SQL tab after it has been run, even if the compute is terminated.\n%sql\r\n\r\nSELECT\r\n   id,\r\n   name,\r\n   age,\r\n   age + 5 AS age_after_5_years\r\nFROM default.users;\nFor more information on Spark UI components, review the Spark\nWeb UI\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-ui-not-in-sync-with-job.json b/scraped_kb_articles/spark-ui-not-in-sync-with-job.json new file mode 100644 index 0000000000000000000000000000000000000000..b38d2fcd7ff140bec901d3e5b96a4c9355b63eaf --- /dev/null +++ b/scraped_kb_articles/spark-ui-not-in-sync-with-job.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/spark-ui-not-in-sync-with-job", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe status of your Spark jobs is not correctly shown in the Spark UI (\nAWS\n|\nAzure\n|\nGCP\n). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank.\nWhen you review the driver logs, you see an AsyncEventQueue warning.\nLogs\r\n=====\r\n20/12/23 21:20:26 WARN AsyncEventQueue: Dropped 93909 events from shared since Wed Dec 23 21:19:26 UTC 2020. \r\n20/12/23 21:21:26 WARN AsyncEventQueue: Dropped 52354 events from shared since Wed Dec 23 21:20:26 UTC 2020. \r\n20/12/23 21:22:26 WARN AsyncEventQueue: Dropped 94137 events from shared since Wed Dec 23 21:21:26 UTC 2020. \r\n20/12/23 21:23:26 WARN AsyncEventQueue: Dropped 44245 events from shared since Wed Dec 23 21:22:26 UTC 2020. \r\n20/12/23 21:24:26 WARN AsyncEventQueue: Dropped 126763 events from shared since Wed Dec 23 21:23:26 UTC 2020.\r\n20/12/23 21:25:26 WARN AsyncEventQueue: Dropped 94156 events from shared since Wed Dec 23 21:24:26 UTC 2020.\nDelete\nInfo\nThis is related to the\nApache Spark UI shows wrong number of jobs\nKB article.\nCause\nAll Spark jobs, stages, and tasks are pushed to the event queue.\nThe backend listener reads the Spark UI events from this queue and renders the Spark UI.\nThe default capacity of the event queue (\nspark.scheduler.listenerbus.eventqueue.capacity\n) is 20000.\nIf more events are pushed to the event queue than the backend listener can consume, the oldest events get dropped from the queue and the listener never consumes them.\nThese events are lost and do not get rendered in the Spark UI.\nSolution\nSet the value of\nspark.scheduler.listenerbus.eventqueue.capacity\nin your cluster’s\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n) at cluster level to a value greater than 20000.\nThis value sets the capacity for the app status event queue, which holds events for internal application status listeners. Increasing this value allows the event queue to hold a larger number of events, but may result in the driver using more memory." +} \ No newline at end of file diff --git a/scraped_kb_articles/spark-ui-wrong-number-jobs.json b/scraped_kb_articles/spark-ui-wrong-number-jobs.json new file mode 100644 index 0000000000000000000000000000000000000000..23f16d0d296fdda7a01e6c59c591bec5d2e4f354 --- /dev/null +++ b/scraped_kb_articles/spark-ui-wrong-number-jobs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/spark-ui-wrong-number-jobs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are reviewing the number of active Apache Spark jobs on a cluster in the Spark UI, but the number is too high to be accurate.\nIf you restart the cluster, the number of jobs shown in the Spark UI is correct at first, but over time it grows abnormally high.\nCause\nThe Spark UI is not always accurate for large, or long-running, clusters due to event drops. The Spark UI requires termination entries to know when an active job has completed. If a job misses this entry, due to errors or unexpected failure, the job may stop running while incorrectly showing as active in the Spark UI.\nDelete\nInfo\nFor more information review the\nApache Spark UI is not in sync with job\nKB article.\nSolution\nYou should not use the Spark UI as a source of truth for active jobs on a cluster.\nThe method\nsc.statusTracker().getActiveJobIds()\nin the Spark API is a reliable way to track the number of active jobs.\nPlease review the\nSpark Status Tracker\ndocumentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/sparkexception-error-when-trying-to-use-an-apache-spark-udf-to-create-and-dynamically-pass-a-prompt-to-the-ai_query-function.json b/scraped_kb_articles/sparkexception-error-when-trying-to-use-an-apache-spark-udf-to-create-and-dynamically-pass-a-prompt-to-the-ai_query-function.json new file mode 100644 index 0000000000000000000000000000000000000000..83b575c4aa7b11baccae2ad5dcec519c757cecf4 --- /dev/null +++ b/scraped_kb_articles/sparkexception-error-when-trying-to-use-an-apache-spark-udf-to-create-and-dynamically-pass-a-prompt-to-the-ai_query-function.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/sparkexception-error-when-trying-to-use-an-apache-spark-udf-to-create-and-dynamically-pass-a-prompt-to-the-ai_query-function", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a legacy Apache Spark user-defined function (UDF) to create a complex prompt and pass it dynamically to the\nai_query()\nfunction, you receive an error. However, if you don’t use the UDF, the\nai_query()\nfunction works.\nExample prompt\nIn the following code,\nlegacy spark udf create_prompt_udf()\nand\npython udf ai_query()\nare called on the same transformation.\nresult_df = df.withColumn(\r\n\"prompt\",\r\ncreate_prompt_udf(\r\nF.col(\"question\"),\r\nF.col(\"topic\"),\r\nF.col(\"category\")\r\n)\r\n).withColumn(\r\n\"answer\",\r\nF.expr(\"\"\"\r\nai_query(\r\n'databricks-meta-llama-3-1-70b-instruct',\r\nprompt\r\n)\r\n\"\"\")\nError message\norg.apache.spark.SparkException: [INTERNAL_ERROR] Expected udfs have the same evalType but got different evalTypes: 100,400 SQLSTATE: XX000\nCause\nLegacy Spark UDFs and\nai_query()\nUDFs have different processing which creates the error.\nLegacy Spark UDFs have\nevalTypes: 100\nwhich is itself a Spark UDF where data is processed row by row. The\nai_query()\nUDF has\nevalTypes: 400\nwhich is a Python function that triggers a Python runner internally and uses batch processing.\nSolution\nUse Unity Catalog (UC) UDFs instead of legacy Spark UDFs. UC UDFs are designed to be compatible with Spark functions like\nai_query()\n. UC UDFs can handle the same batch processing method as the\nai_query()\nUDF, ensuring that there is no conflict in the evaluation types.\nFor more information, refer to the\nUser-defined functions (UDFs) in Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/sparkr-gapply.json b/scraped_kb_articles/sparkr-gapply.json new file mode 100644 index 0000000000000000000000000000000000000000..8e29a6ea29c195bfbaa4955333417747d8022f5f --- /dev/null +++ b/scraped_kb_articles/sparkr-gapply.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/sparkr-gapply", + "title": "Título do Artigo Desconhecido", + "content": "Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed computing. In such cases, the SparkR UDF API can be used to distribute the desired workload across a cluster.\nExample use case: You want to train a machine learning model on subsets of a data set, grouped by a key. If the subsets of the data fit on the workers, it may be more efficient to use the SparkR UDF API to train multiple models at once.\nThe\ngapply\nand\ngapplyCollect\nfunctions apply a function to each group in a Spark DataFrame. For each group in a Spark DataFrame:\nCollect each group as an R data.frame.\nSend the function to the worker and execute.\nReturn the result to the driver as specified by the schema.\nDelete\nInfo\nWhen you call\ngapply\n, you must specify the output schema. With\ngapplyCollect\n, the result is collected to the driver using an R data.frame for the output.\nIn the following example, a separate support vector machine model is fit on the\nairquality\ndata for each month. The output is a data.frame with the resulting MSE for each month, shown both with and without specifying the schema.\n%r\r\n\r\ndf <- createDataFrame(na.omit(airquality))\r\n\r\nschema <- structType(\r\n  structField(\"Month\", \"MSE\"),\r\n  structField(\"integer\", \"Number\"))\r\n\r\nresult <- gapply(df, c(\"Month\"), function(key, x) {\r\n  library(e1071)\r\n  data.frame(month = key, mse = svm(Ozone ~ ., x, cross = 3)$tot.MSE)\r\n}, schema)\n%r\r\n\r\ndf <- createDataFrame(na.omit(airquality))\r\n\r\ngapplyCollect(df, c(\"Month\"), function(key, x) {\r\n  library(e1071)\r\n  y <- data.frame(month = key, mse = svm(Ozone ~ ., x, cross = 3)$tot.MSE)\r\n names(y) <- c(\"Month\", \"MSE\")\r\n  y\r\n})\nDelete\nInfo\nStart with a Spark DataFrame and install packages on all workers." +} \ No newline at end of file diff --git a/scraped_kb_articles/sparkr-lapply.json b/scraped_kb_articles/sparkr-lapply.json new file mode 100644 index 0000000000000000000000000000000000000000..22f054929326609035c3081d040b6700fbf83bb6 --- /dev/null +++ b/scraped_kb_articles/sparkr-lapply.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/sparkr-lapply", + "title": "Título do Artigo Desconhecido", + "content": "Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed computing. In such cases, the SparkR UDF API can be used to distribute the desired workload across a cluster.\nExample use case: You want to train multiple machine learning models on the same data, for example for hyper parameter tuning. If the data set fits on each worker, it may be more efficient to use the SparkR UDF API to train several versions of the model at once.\nThe\nspark.lapply\nfunction enables you to perform the same task on multiple workers, by running a function over a list of elements. For each element in a list:\nSend the function to a worker.\nExecute the function.\nReturn the result of all workers as a list to the driver.\nIn the following example, a support vector machine model is fit on the\niris\ndataset with 3-fold cross validation while the cost is varied from 0.5 to 1 by increments of 0.1. The output is a list with the summary of the models for the various cost parameters.\n%r\r\n\r\nlibrary(SparkR)\r\n\r\nspark.lapply(seq(0.5, 1, by = 0.1), function(x) {\r\n  library(e1071)\r\n  model <- svm(Species ~ ., iris, cost = x, cross = 3)\r\n  summary(model)\r\n})\nDelete\nInfo\nYou must install packages on all workers." +} \ No newline at end of file diff --git a/scraped_kb_articles/special-characters-appearing-as-junk-characters-when-using-the-simba-spark-odbc-driver-to-access-databricks-tables-from-external-applications.json b/scraped_kb_articles/special-characters-appearing-as-junk-characters-when-using-the-simba-spark-odbc-driver-to-access-databricks-tables-from-external-applications.json new file mode 100644 index 0000000000000000000000000000000000000000..d929cee31f04a7e4e580b69f1e2f2e6796c1944a --- /dev/null +++ b/scraped_kb_articles/special-characters-appearing-as-junk-characters-when-using-the-simba-spark-odbc-driver-to-access-databricks-tables-from-external-applications.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/special-characters-appearing-as-junk-characters-when-using-the-simba-spark-odbc-driver-to-access-databricks-tables-from-external-applications", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the Simba Spark ODBC Driver to access Databricks tables containing special characters (such as Swedish characters Å, Ä, and Ö) from external applications like SAS, the special characters do not display correctly in the external application. Instead, special characters appear as junk characters.\nCause\nThe encoding used in Databricks, UTF-8, differs from the encoding used in the external application, ISO 8859-15/latin9. The Simba Spark ODBC Driver does not automatically transcode the characters from UTF-8 to ISO 8859-15/latin9, resulting in junk characters appearing in the external application.\nSolution\nSet the environment variable\nSIMBA_APP_ANSI_ENCODING\nto\nISO-8859-15\nbefore starting the Simba Spark ODBC Driver. This variable instructs the driver to transcode the characters from UTF-8 to ISO 8859-15/latin9, ensuring that the special characters display correctly in the external application.\nFollow these steps to set the environment variable and configure the Simba Spark ODBC Driver.\n1. Open the ODBC Data Source Administrator (64-bit) on the machine where the Simba Spark ODBC Driver is installed.\n2. Navigate to the\nSystem DSN\ntab and select the Databricks DSN.\n3. Click\nConfigure\nto open the Simba Spark ODBC Driver configuration window.\n4. Go to the\nAdvanced options\ntab.\n5. In the\nEnvironment variables\nsection, add a new variable with the name\nSIMBA_APP_ANSI_ENCODING\nand set its value to\nISO-8859-15\n.\n6. Click\nOK\nto save the changes and close the configuration window.\n7. Test the connection to confirm that the special characters now display correctly in the external application." +} \ No newline at end of file diff --git a/scraped_kb_articles/special-characters-in-xml.json b/scraped_kb_articles/special-characters-in-xml.json new file mode 100644 index 0000000000000000000000000000000000000000..14da0b0f8a08a49eb587826c17ef4e378be7dd10 --- /dev/null +++ b/scraped_kb_articles/special-characters-in-xml.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/special-characters-in-xml", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have special characters in your source files and are using the OSS library\nSpark-XML\n.\nThe special characters do not render correctly.\nFor example, “CLU®” is rendered as “CLU�”.\nCause\nSpark-XML supports the UTF-8 character set by default. You are using a different character set in your XML files.\nSolution\nYou must specify the character set you are using in your XML files when reading the data.\nUse the\ncharset\noption to define the character set when reading an XML file with Spark-XML.\nFor example, if your source file is using ISO-8859-1:\n%python\r\n\r\ndfResult = spark.read.format('xml').schema(customSchema) \\\r\n.options(rowTag='Entity') \\\r\n.options(charset='ISO-8859-1')\\\r\n.load('//.xml')\nReview the\nSpark-XML README\nfile for more information on supported options." +} \ No newline at end of file diff --git a/scraped_kb_articles/speed-up-cross-validation.json b/scraped_kb_articles/speed-up-cross-validation.json new file mode 100644 index 0000000000000000000000000000000000000000..5cab3d9cc0249e2e6873042970277cf25b1cce9d --- /dev/null +++ b/scraped_kb_articles/speed-up-cross-validation.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/speed-up-cross-validation", + "title": "Título do Artigo Desconhecido", + "content": "Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up:\nCache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multiple times benefit from a cache. Remember to call an action on the\nDataFrame\nfor the cache to take effect.\nIncrease the parallelism parameter inside the\nCrossValidator\n, which sets the number of threads to use when running parallel algorithms. The default setting is 1. See the CrossValidator documentation for more information.\nDon’t use the pipeline as the estimator inside the\nCrossValidator\nspecification. In some cases where the featurizers are being tuned along with the model, running the whole pipeline inside the\nCrossValidator\nmakes sense. However, this executes the entire pipeline for every parameter combination and fold. Therefore, if only the model is being tuned, set the model specification as the estimator inside the\nCrossValidator\n.\nDelete\nInfo\nCrossValidator\ncan be set as the final stage inside the pipeline after the featurizers. The best model identified by the\nCrossValidator\nis output." +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-access-control-error-when-using-snowflake-as-a-data-source.json b/scraped_kb_articles/sql-access-control-error-when-using-snowflake-as-a-data-source.json new file mode 100644 index 0000000000000000000000000000000000000000..08a8010e2c46c1dd35e1ea6c880c0627e7a104b2 --- /dev/null +++ b/scraped_kb_articles/sql-access-control-error-when-using-snowflake-as-a-data-source.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/sql-access-control-error-when-using-snowflake-as-a-data-source", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe\nSnowflake Connector for Spark\nis used to read data from, and write data to, Snowflake while working in Databricks. The connector makes Snowflake look like another Spark data source.\nWhen you try to query Snowflake, your get a\nSnowflakeSQLException\nerror message.\nSnowflakeSQLException: SQL access control error: Insufficient privileges to operate on schema '' \r\n2Insufficient privileges to operate on schema '' at net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowExceptionSub(SnowflakeUtil.java:127)\nCause\nWhen you attempted to\nread and write data from Snowflake\n(\nAWS\n|\nAzure\n|\nGCP\n) you used\nschema\ninstead of\nsfschema\n.\n%python\r\n\r\nsnowflake_table = (spark.read\n.format(\"snowflake\")\r\n  .option(\"dbtable\", )\r\n  .option(\"sfUrl\", )\r\n  .option(\"sfUser\", )\r\n  .option(\"sfPassword\", )\r\n  .option(\"sfDatabase\", )\r\n  .option(\"Schema\", )\r\n  .option(\"sfWarehouse\", )\r\n  .load()\r\n)\nSnowflake does not officially support\nschema\nas an option.\nIn some cases,\nschema\nis treated as\nsfschema\n, but there is no guarantee that this will happen.\nSolution\nWhen reading or writing data from Snowflake you must use\nsfschema\ninstead of\nschema\nin Snowflake options.\n%python\r\n\r\nsnowflake_table = (spark.read\r\n .format(\"snowflake\")\r\n .option(\"dbtable\", )\r\n .option(\"sfUrl\", )\r\n .option(\"sfUser\", )\r\n .option(\"sfPassword\", )\r\n .option(\"sfDatabase\", )\r\n .option(\"sfSchema\", )\r\n .option(\"sfWarehouse\", )\r\n .load()\r\n)\nPlease review the Snowflake\nUsing the Spark Connector\ndocumentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-compilation-error-when-trying-to-access-a-snowflake-defined-view.json b/scraped_kb_articles/sql-compilation-error-when-trying-to-access-a-snowflake-defined-view.json new file mode 100644 index 0000000000000000000000000000000000000000..6faa22f83560e2593233a8307fcee117094d287a --- /dev/null +++ b/scraped_kb_articles/sql-compilation-error-when-trying-to-access-a-snowflake-defined-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/sql-compilation-error-when-trying-to-access-a-snowflake-defined-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re working in Lakehouse Federation and want to read data from Snowflake. You try to access a table directly from Catalog Explorer and encounter the following error message.\nYour request failed with status FAILED: [BAD_REQUEST] SQL compilation error: View definition for '..' declared X column(s), but view query produces column(s).\nThis issue arises while reading the\n..\ntable. This table in Databricks corresponds to a view defined in Snowflake.\nCause\nThere is a discrepancy between the columns defined in the view and the columns produced by the underlying query in Snowflake.\nThe view\n..\nin Snowflake declares X columns, but the underlying query for this view now produces greater than X columns, leading to a mismatch and subsequent SQL compilation error.\nSolution\nConfirm view definition and column discrepancy\nFirst, ensure that\n..\nis indeed a view in Snowflake and not a regular table in Databricks.\nRun the following commands in Snowflake to validate the view definition and describe the structure of the view. These commands help you confirm the declared columns in the view against those produced by the underlying query.\nSHOW CREATE VIEW ..;\r\nDESCRIBE EXTENDED ..;\nModify the view in Snowflake\nIf the mismatch is confirmed, update the view definition to ensure the number of columns declared matches the columns produced.\nRefer to the Snowflake\nALTER VIEW\ndocumentation for detailed instructions on how to modify the view.\nSynchronize the view definition\nAfter updating the view definition, re-run the\nREFRESH FOREIGN CATALOG\ncommand in Databricks. This step ensures that the view definition and the produced query columns in Snowflake are synchronized, resolving the SQL compilation error." +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-in-python.json b/scraped_kb_articles/sql-in-python.json new file mode 100644 index 0000000000000000000000000000000000000000..f1990e03151667657b1d21acedb949a325b51d8e --- /dev/null +++ b/scraped_kb_articles/sql-in-python.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/sql-in-python", + "title": "Título do Artigo Desconhecido", + "content": "You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (\nAWS\n|\nAzure\n), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using\nPyHive\nand then run a script. You should have PyHive installed on the machine where you are running the Python script.\nDelete\nInfo\nPython 2 is considered\nend-of-life\n. You should use Python 3 to run the script provided in this article. If you have both Python 2 and Python 3 running on your system, you should make sure your version of pip is linked to Python 3 before you proceed.\nYou can check your version of pip by running\npip -V\nat the command prompt. This command returns the version of pip and the version of Python it is using.\nInstall PyHive and Thrift\nUse pip to install PyHive and Thrift.\n%sh\r\n\r\npip install pyhive thrift\nRun SQL script\nThis sample Python script sends the SQL query\nshow tables\nto your cluster and then displays the result of the query.\nDo the following before you run the script:\nReplace\n\nwith your Databricks API token.\nReplace\n\nwith the domain name of your Databricks deployment.\nReplace\n\nwith the Workspace ID.\nReplace\n\nwith a cluster ID.\nTo get the API token, see Generate a token (\nAWS\n|\nAzure\n). To determine the other values, see How to get Workspace, Cluster, Notebook, and Job Details (\nAWS\n|\nAzure\n).\n%python\r\n\r\n#!/usr/bin/python\r\n\r\n\r\nimport os\r\nimport sys\r\nfrom pyhive import hive\r\nfrom thrift.transport import THttpClient\r\nimport base64\r\n\r\n\r\nTOKEN = \"\"\r\nWORKSPACE_URL = \"\"\r\nWORKSPACE_ID = \"\"\r\nCLUSTER_ID = \"\"\r\n\r\n\r\nconn = 'https://%s/sql/protocolv1/o/%s/%s' % (WORKSPACE_URL, WORKSPACE_ID, CLUSTER_ID)\r\nprint(conn)\r\n\r\n\r\ntransport = THttpClient.THttpClient(conn)\r\n\r\n\r\nauth = \"token:%s\" % TOKEN\r\nPY_MAJOR = sys.version_info[0]\r\n\r\n\r\nif PY_MAJOR < 3:\r\n  auth = base64.standard_b64encode(auth)\r\nelse:\r\n  auth = base64.standard_b64encode(auth.encode()).decode()\r\n\r\n\r\ntransport.setCustomHeaders({\"Authorization\": \"Basic %s\" % auth})\r\n\r\n\r\ncursor = hive.connect(thrift_transport=transport).cursor()\r\n\r\n\r\ncursor.execute('show tables',async_=True)\r\n\r\n\r\npending_states = (\r\n        hive.ttypes.TOperationState.INITIALIZED_STATE,\r\n        hive.ttypes.TOperationState.PENDING_STATE,\r\n        hive.ttypes.TOperationState.RUNNING_STATE)\r\n\r\n\r\nwhile cursor.poll().operationState in pending_states:\r\n    print(\"Pending...\")\r\n\r\n\r\nprint(\"Done. Results:\")\r\n\r\n\r\nfor table in cursor.fetchall():\r\n    print(table)" +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-query-on-bigquery-table-fails-with-classcastexception-error.json b/scraped_kb_articles/sql-query-on-bigquery-table-fails-with-classcastexception-error.json new file mode 100644 index 0000000000000000000000000000000000000000..3b55b63bb394641d43bf9b58ca8be9e7027dd252 --- /dev/null +++ b/scraped_kb_articles/sql-query-on-bigquery-table-fails-with-classcastexception-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/sql-query-on-bigquery-table-fails-with-classcastexception-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen querying a BigQuery table from a Databricks cluster running on Databricks Runtime versions 16.2 and above, the query fails with a\nClassCastException\nerror.\nExample query\nSELECT * FROM industry.employees LIMIT 10\nExample error\nClassCastException: class org.apache.spark.sql.types.StringType$ cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.StringType$ and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')\nCause\nYou have a table and a column with the same name.\nWhen a SQL query references a table and a column with the same name without being explicit, it leads to ambiguity in BigQuery. This is a known issue with the BigQuery storage adapter.\nFor more detail and context, refer to the Google forum topic\nIssue create view when the source table and column have the same name\n.\nSolution\nThe issue is not on the Databricks side. Databricks ships an OSS connector built by Google.\nGoogle suggests two options to fix this issue. The example code lines continue the hardcoded example from the problem to demonstrate the changes. Change\nemployees\nand\nindustry.employees\nto reflect your respective column and table names.\nThe first option is to rename the column with an alias in the query to avoid confusion.\nSELECT employees AS emp FROM industry.employees\nThe second option is to fully specify both the table name and the column name.\nSELECT industry.employees.employees FROM industry.employees\nIf the issue persists despite the suggested steps, contact Google Support through their forums." +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-transformations-involving-timestamp-columns-giving-different-results-in-an-interactive-cluster-versus-serverless-compute.json b/scraped_kb_articles/sql-transformations-involving-timestamp-columns-giving-different-results-in-an-interactive-cluster-versus-serverless-compute.json new file mode 100644 index 0000000000000000000000000000000000000000..57869af828d92c34e194d31ab0c477b715595f4c --- /dev/null +++ b/scraped_kb_articles/sql-transformations-involving-timestamp-columns-giving-different-results-in-an-interactive-cluster-versus-serverless-compute.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/sql-transformations-involving-timestamp-columns-giving-different-results-in-an-interactive-cluster-versus-serverless-compute", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen performing SQL transformations involving timestamp columns, you get unexpectedly different results in an interactive cluster versus serverless compute.\nCause\nYour interactive cluster is running on a version of Databricks Runtime preconfigured to use Java Development Kit (JDK) 8. Serverless compute runs on Databricks Runtime 16.0, which is preconfigured with JDK 17.\nJava Development Kit (JDK) changed the precision of timestamp values starting in JDK 9, refining it to microseconds from milliseconds.\nContext\nOpenJDK introduced an enhancement,\nIncrease the precision of the implementation of java.time.Clock.systemUTC()\nstarting in Java 9. This enhancement allows for capturing time to the microsecond (\nyyyy-MM-dd HH:mm:ss.SSSSSS\n). In Java 8, capturing the current moment is still limited to only milliseconds (\nyyyy-MM-dd HH:mm:ss.SSS\n).\nSolution\nUse SQL’s type casting to handle precision, or upgrade your JDK version.\nUse SQL’s type casting\nUse the\ndate_format(\"timestamp\", \"yyyy-MM-dd HH:mm:ss.SSS\")\nfunction to keep the precision level to milliseconds, allowing the new data to match the preexisting data.\nExamples\nThis code snippet…\nchanges to…\nSELECT * FROM Target tgt INNER JOIN Source src\nON\ntgt.etl_time = src.etl_time\nSELECT * FROM Target tgt INNER JOIN Source src\nON\ndate_format(tgt.etl_time, \"yyyy-MM-dd HH:mm:ss.SSS\") = date_format(src.etl_time, \"yyyy-MM-dd HH:mm:ss.SSS\")\nSELECT max(etl_time) FROM Target\nSELECT max(date_format(etl_time, \"yyyy-MM-dd HH:mm:ss.SSS\")) FROM Target\nUpgrade your JDK version\nRefer to the\nDatabricks SDK for Java\n(\nAWS\n|\nAzure\n|\nGCP\n) for instructions on creating a cluster with JDK 17.\nNote\nFor Databricks Runtime versions 13.1 to 15.4, JDK 8 is the default, and JDK 17 is in Public Preview." +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-warehouse-launch-fails-to-start-with-permission_denied.json b/scraped_kb_articles/sql-warehouse-launch-fails-to-start-with-permission_denied.json new file mode 100644 index 0000000000000000000000000000000000000000..11963d4ce66eac4cbbd7ba7a66dc0b563d97c348 --- /dev/null +++ b/scraped_kb_articles/sql-warehouse-launch-fails-to-start-with-permission_denied.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/sql-warehouse-launch-fails-to-start-with-permission_denied", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you attempt to start an SQL warehouse, you receive the following error.\n\"Clusters are failing to launch. Cluster launch will be retried. Request to create a cluster failed with an exception: PERMISSION_DENIED: You are not authorized to create clusters. Please contact your administrator.\"\nCause\nThe issue occurs when the SQL warehouse owner does not have workspace admin privileges or permissions for unrestricted cluster creation.\nSolution\nVerify the current owner of the SQL warehouse.\nHave your workspace admin assign one of the following permissions to the warehouse owner:\nUnrestricted cluster creation permissions. For instructions, review the “Compute entitlements” section of the\nManage entitlements\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nWorkspace admin privileges to the SQL warehouse owner. For instructions, review the “Assign the workspace admin role to a user” section of the\nManage users\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf the warehouse owner’s permissions change in the future, you may need to follow these steps again.\nFor more information on launching a SQL warehouse, refer to the\nCreate a SQL warehouse\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/sql-warehouse-will-not-launch-after-upgrading-to-serverless.json b/scraped_kb_articles/sql-warehouse-will-not-launch-after-upgrading-to-serverless.json new file mode 100644 index 0000000000000000000000000000000000000000..ee84568b409646ad0050d10a9547888e5e2e4da0 --- /dev/null +++ b/scraped_kb_articles/sql-warehouse-will-not-launch-after-upgrading-to-serverless.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/sql-warehouse-will-not-launch-after-upgrading-to-serverless", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to start a SQL warehouse after upgrading it from pro to serverless, you encounter an error that prevents the warehouse from launching.\n\"Clusters are failing to launch. Cluster launch will be retried. Request to create a cluster failed with an exception: Serverless warehouse cannot start since warehouse `` has runtime version overrides. Please remove the overrides to continue using the warehouse.\"\nCause\nYou’re using a custom Databricks runtime version with your SQL warehouse. When you change the warehouse from pro to serverless, any existing runtime version overrides become incompatible with the new serverless configuration.\nServerless SQL warehouses do not support runtime version overrides, resulting in the error.\nSolution\nImportant\nYou may want to set\ntest_overrides\nto\nNULL\nto solve this problem.\nNULL\noverrides all custom Apache Spark configurations for the warehouse. The following solution shows you how to indicate specific Spark configurations using key-value pairs.\nIn your SQL warehouse configuration, indicate the runtime version overrides to remove as a key-value pair within the\nspark_conf\nfield. The following code provides an example.\n{\r\n    \"confs\":\r\n{\r\n    \"test_overrides\": {\r\n        \"cluster_attributes\": {\r\n            \"spark_conf\": {\r\n                key: \r\n                value: \r\n                          }\r\n                              }\r\n                      }\r\n    }\r\n}\nUse the API to execute a POST call with this updated configuration. Refer to the\nUpdate a warehouse\n(\nAWS\n|\nAzure\n|\nGCP\n) API documentation for details.\nAfter updating the configuration, restart the SQL warehouse. The warehouse automatically takes the default Databricks Runtime upon restarting." +} \ No newline at end of file diff --git a/scraped_kb_articles/ss-read-from-last-offset.json b/scraped_kb_articles/ss-read-from-last-offset.json new file mode 100644 index 0000000000000000000000000000000000000000..6ce3d840999bade4ea0980938675849cad9e83c6 --- /dev/null +++ b/scraped_kb_articles/ss-read-from-last-offset.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/ss-read-from-last-offset", + "title": "Título do Artigo Desconhecido", + "content": "Scenario\nYou have a stream, running a windowed aggregation query, that reads from Apache Kafka and writes files in\nAppend\nmode. You want to upgrade the application and restart the query with the offset equal to the last written offset. You want to discard all state information that hasn’t been written to the sink, start processing from the earliest offsets that contributed to the discarded state, and modify the checkpoint directory accordingly.\nHowever, if you use existing checkpoints after upgrading the application code, old states and objects from the previous application version are re-used, which results in unexpected output such as reading from old sources or processing with old application code.\nSolution\nApache Spark maintains state across the execution and binary objects on checkpoints. Therefore you cannot modify the checkpoint directory. As an alternative, copy and update the offset with the input records and store this in a file or a database. Read it during the initialization of the next restart and use the same value in\nreadStream\n. Make sure to delete the checkpoint directory.\nYou can get the current offsets by using asynchronous APIs:\n%scala\r\n\r\nspark.streams.addListener(new StreamingQueryListener() {\r\n    override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {\r\n        println(\"Query started:\" + queryStarted.id)\r\n    }\r\n    override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {\r\n        println(\"Query terminated\" + queryTerminated.id)\r\n    }\r\n    override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {\r\n     println(\"Query made progress\")\r\n        println(\"Starting offset:\" + queryProgress.progress.sources(0).startOffset)\r\n        println(\"Ending offset:\" + queryProgress.progress.sources(0).endOffset)\r\n        //Logic to save these offsets\r\n    }\r\n})\nYou can use\nreadStream\nwith the latest offset written by the process shown above:\n%scala\r\n\r\noption(\"startingOffsets\",  \"\"\" {\"articleA\":{\"0\":23,\"1\":-1},\"articleB\":{\"0\":-2}} \"\"\")\nThe input schema for streaming records is:\nroot\r\n|-- key: binary (nullable = true)\r\n|-- value: binary (nullable = true)\r\n|-- article: string (nullable = true)\r\n|-- partition: integer (nullable = true)\r\n|-- offset: long (nullable = true)\r\n|-- timestamp: timestamp (nullable = true)\r\n|-- timestampType: integer (nullable = true)\nAlso, you can implement logic to save and update the offset to a database and read it at the next restart." +} \ No newline at end of file diff --git a/scraped_kb_articles/ssl-error-when-invoking-databricks-model-serving-endpoint.json b/scraped_kb_articles/ssl-error-when-invoking-databricks-model-serving-endpoint.json new file mode 100644 index 0000000000000000000000000000000000000000..d319e303bd6e3b1cbf77e8e224ad6b7422a2dfed --- /dev/null +++ b/scraped_kb_articles/ssl-error-when-invoking-databricks-model-serving-endpoint.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/ssl-error-when-invoking-databricks-model-serving-endpoint", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen querying a Databricks model serving endpoint, you encounter an error indicating the TLS connection was unexpectedly terminated during the HTTPS request to the model serving endpoint.\nSSL Error: HTTPSConnectionPool(host='.cloud.databricks.com', port=443): Max retries exceeded with url: /serving-endpoints/gpt2-endpoint/invocations (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2406)')))\nCause\nThe request payload exceeds the maximum size limits enforced by the Databricks model serving infrastructure.\nCurrently, the payload size limit is 16 MB per request for custom models. For endpoints serving foundation models, external models, or AI agents, the limit is 4 MB per request.\nPayloads exceeding these thresholds may result in backend rejection or abrupt TLS termination, leading to SSL-related errors.\nSolution\nReduce input payload size\nEnsure your serialized JSON input (inputs, params, and so on) is below 16 MB, or 4 MB for foundation model endpoints.\nSplit large inputs, such as large documents or long token sequences, into smaller parts if needed.\nCheck payload size before sending\nTo avoid request failures, consider adding a pre-check in your client code to verify that the payload does not exceed the limit. You can raise a\nValueError\nif the payload size surpasses the allowed threshold, as shown in the following example.\ndata = {\r\n   \"inputs\": [large_payload],\r\n   \"params\": {\"max_new_tokens\": 10, \"temperature\": 1}\r\n}\r\nencoded_payload = json.dumps(data).encode(\"utf-8\")\r\nif len(encoded_payload) > 16_000_000:\r\n   raise ValueError(\"Payload exceeds 16MB limit. Reduce input size.\")\nFor additional information, review the\nModel Serving limits and regions\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/ssl_connect-certificate-verify-failed-error-when-trying-to-connect-to-databricks-from-tableau.json b/scraped_kb_articles/ssl_connect-certificate-verify-failed-error-when-trying-to-connect-to-databricks-from-tableau.json new file mode 100644 index 0000000000000000000000000000000000000000..68594596cada6bdaf90e483ab6388662efc1e747 --- /dev/null +++ b/scraped_kb_articles/ssl_connect-certificate-verify-failed-error-when-trying-to-connect-to-databricks-from-tableau.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/bi/ssl_connect-certificate-verify-failed-error-when-trying-to-connect-to-databricks-from-tableau", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to connect to Databricks from Tableau, you receive an error.\nError: Unexpected response from server during an HTTP connection: SSL_connect: certificate verify failed.\r\nError code: B19090E0\nThe following image shows an example of the error in the UI.\nCause\nIntermediate certificates from sources such as corporate root certificates, VPNs, proxy servers, or firewalls are not automatically picked up.\nSolution\nManually add intermediate certificates to the root certificates file in order to be recognized by the ODBC driver.\nImportant\nBefore proceeding, confirm with your organization’s infrastructure team whether the certificates belong to your infrastructure. They may relate to firewalls, VPNs, proxy servers, and so on.\nOn the machine where you are running Tableau Desktop, open the Databricks workspace in a new tab and locate the certificate chain. Export all intermediate certificates and append content from all the generated files to the\ncacerts.pem\nfile.\nThe following image is an example of the certificate viewer in the UI.\nAlternatively, you can run the following command to export the certificates and append the content.\nopenssl s_client -showcerts -connect :443 /dev/null | openssl x509 -outform PEM > databricks_cert_chain.pem\nThe generated\ndatabricks_cert_chain.pem\nfile will now contain the intermediate certificates.\nOnce the certificates are verified, copy the contents of\ndatabricks_cert_chain.pem\nto the ODBC driver\ncacerts.pem\nfile. The following list shows the paths to the ODBC driver\ncacerts.pem\nfile in Linux, MacOS, and Windows.\nLinux -\n/opt/simba/spark/lib/cacerts.pem\nMacOS -\n/Library/simba/spark/lib/cacerts.pem\nWindows -\nC:\\Program Files\\Simba Spark ODBC Driver\\lib\\cacerts.pem\nIf the issue still persists, set\nAllowDetailedSSLErrorMessages=1\nand\nEnableCurlDebugLogging=1\nin the\nmicrosoft.sparkodbc.ini\nfile to obtain more detailed SSL debugging logs." +} \ No newline at end of file diff --git a/scraped_kb_articles/sso-saml-failure-when-authenticating-in-databricks-using-active-directory-federation-services-ad-fs.json b/scraped_kb_articles/sso-saml-failure-when-authenticating-in-databricks-using-active-directory-federation-services-ad-fs.json new file mode 100644 index 0000000000000000000000000000000000000000..a441a260686f1940f75f9c0b24c7e4cf3a5f8c69 --- /dev/null +++ b/scraped_kb_articles/sso-saml-failure-when-authenticating-in-databricks-using-active-directory-federation-services-ad-fs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/sso-saml-failure-when-authenticating-in-databricks-using-active-directory-federation-services-ad-fs", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou experience single sign-on (SSO) failures when authenticating in Databricks through Active Directory Federation Services (AD FS). The following error message displays in the browser after login.\n{ “message” : \"The service at / is temporarily unavailable. Please try again later [TraceId: -]\", \"error_code\": \"TEMPORARILY_UNAVAILABLE\"}\nCause\nAD FS has sent a malformed SAML response, with an improperly formatted\nemailaddress\nattribute.\nThe\nemailaddress\nattribute in the XML format contains a newline character (\n\\n\n) at the end of the email value. In the SAML XML specification, a newline character is considered a whitespace character and can break the schema, leading to incorrect attribute processing during authentication.\nSolution\n1. Verify your email address formatting in AD FS.\n2. Remove any trailing newline or whitespace characters from the\nemailaddress\nattribute value.\n3. Even if the\nemailaddress\nattribute appears correct and does not contain any visible spaces or newline characters, remove and re-add yourself  to AD FS. This ensures any cached or incorrect metadata associated with your credential is cleared and updated.\nThe following code shows the correctly formatted\nemailaddress\nattribute without any line breaks after the email address, before the closing\nAttributeValue\ntag.\n\r\n  \r\n    first.last@domain.com\r\n  \r\n\n4. Validate the SAML response. Open your browser developer tools while initiating the SSO as shown below. Copy the SAMLResponse payload and decode it.\nAlternatively, if you have access to a bash shell, you can decode it locally using the following command.\n$ echo \"\" |  tr -d '\\n' | base64 -d | xmllint --format -\n5. Confirm that the emailaddress attribute value is correctly formatted without any newline characters.\n6. Reattempt authentication to ensure the issue is resolved." +} \ No newline at end of file diff --git a/scraped_kb_articles/standard-formerly-shared-cluster-not-allowing-use-of-machine-learning-runtime.json b/scraped_kb_articles/standard-formerly-shared-cluster-not-allowing-use-of-machine-learning-runtime.json new file mode 100644 index 0000000000000000000000000000000000000000..5f75cb24fec84f1cbd71455a8d4ccfe71e122f14 --- /dev/null +++ b/scraped_kb_articles/standard-formerly-shared-cluster-not-allowing-use-of-machine-learning-runtime.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/standard-formerly-shared-cluster-not-allowing-use-of-machine-learning-runtime", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to create a standard (formerly shared) compute with Machine Learning Runtime, you get the following message.\n“Databricks Runtime for Machine Learning does not support User Isolation security mode. Use Single User security mode if you need to access Unity Catalog.”\nCause\nDatabricks Runtime ML and Spark Machine Learning Library (MLlib) are not supported in standard compute.\nSolution\nUse a dedicated (formerly single-user) compute attached to a group with the required list of users.\nFor details, refer to the\nAssign compute resources to a group\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/stateful-streaming-query-failing-with-sparksecurityexception-error-after-restarting-on-a-shared-cluster-or-serverless.json b/scraped_kb_articles/stateful-streaming-query-failing-with-sparksecurityexception-error-after-restarting-on-a-shared-cluster-or-serverless.json new file mode 100644 index 0000000000000000000000000000000000000000..6b817df04d157b1479e2760bf46240d5941a95a4 --- /dev/null +++ b/scraped_kb_articles/stateful-streaming-query-failing-with-sparksecurityexception-error-after-restarting-on-a-shared-cluster-or-serverless.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/stateful-streaming-query-failing-with-sparksecurityexception-error-after-restarting-on-a-shared-cluster-or-serverless", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have a stateful streaming query with\ndropDuplicates()\nor\ndropDuplicatesWithinWatermark()\nspecified, which was previously running on a single user cluster or a No Isolation Shared cluster. When you try to restart the query on a shared cluster or serverless, the query fails. You receive the following error (the exact details depend on your context).\norg.apache.spark.SparkSecurityException: Could not verify permissions for logical node ~WriteToMicroBatchDataSourceV1 ForeachBatchSink, XXXX, [checkpointLocation=s3://XXXX], Append, 0\nWhen you check the error stack trace, you see the following message.\nCaused by: org.apache.spark.sql.execution.streaming.state.InvalidUnsafeRowException: The streaming query failed by state format invalidation. The following reasons may cause this: 1. An old Spark version wrote the checkpoint that is incompatible with the current one; 2. Broken checkpoint files; 3. The query is changed among restart. For the first case, you can try to restart the application without checkpoint or use the legacy Spark version to process the streaming state.Error message is: Variable-length field validation error: field: XXXX,XXXX\nCause\nThe cluster mode change causes the issue when using Spark Connect in Databricks Runtime 15.4 LTS and above. There is known Spark Connect limitation on switching cluster modes on Stateful Streaming queries. For more information, review\nSPARK-49722\n.\nSolution\nWhen you restart Stateful Streaming queries, use the same cluster access mode." +} \ No newline at end of file diff --git a/scraped_kb_articles/stateful-structured-streaming-jobs-fail-after-making-changes-to-stateful-operations.json b/scraped_kb_articles/stateful-structured-streaming-jobs-fail-after-making-changes-to-stateful-operations.json new file mode 100644 index 0000000000000000000000000000000000000000..5de3102b618abb469f4d785e08b4dac51706235b --- /dev/null +++ b/scraped_kb_articles/stateful-structured-streaming-jobs-fail-after-making-changes-to-stateful-operations.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/stateful-structured-streaming-jobs-fail-after-making-changes-to-stateful-operations", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAfter making changes to a stateful Structured Streaming operation, you encounter a failure.\nExample\nThe exact error message displayed depends on your state store provider. The following example demonstrates using\ndropDuplicates()\non a streaming DataFrame with a watermark in RocksDB.\n# original code:\r\nstreaming_df = streaming_df.dropDuplicates(subset=[\"column1\"])\r\n\r\n# later changed to:\r\nstreaming_df = streaming_df.dropDuplicates(subset=[\"column1\", \"column2\"])\norg.apache.spark.sql.execution.streaming.state.StateStoreKeySchemaNotCompatible: [STATE_STORE_KEY_SCHEMA_NOT_COMPATIBLE] Provided key schema does not match existing state key schema.\r\nPlease check number and type of fields.\r\nExisting key_schema=StructType(StructField(column1,IntegerType,true)) and new key_schema=StructType(StructField(column1,IntegerType,true),StructField(column2,TimestampType,true)).\nCause\nStateful operations in streaming queries require maintaining state data to continuously update results.\nStructured Streaming automatically checkpoints this state data to fault-tolerant storage like HDFS, AWS S3, or Azure Blob storage and restores it after restart. However, this assumes the state data schema remains unchanged across restarts.\nFor more information on recovery semantics after changes in a streaming query, refer to the Apache Spark\nStructured Streaming Programming Guide\n.\nSolution\nOnly make changes to stateful code if you are willing to create a new checkpoint." +} \ No newline at end of file diff --git a/scraped_kb_articles/status-failure-when-attempting-to-update-a-direct-vector-access-index.json b/scraped_kb_articles/status-failure-when-attempting-to-update-a-direct-vector-access-index.json new file mode 100644 index 0000000000000000000000000000000000000000..306835af8d2e5b9a6cd282aaeb830b9a3583fa00 --- /dev/null +++ b/scraped_kb_articles/status-failure-when-attempting-to-update-a-direct-vector-access-index.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/status-failure-when-attempting-to-update-a-direct-vector-access-index", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to upsert into a Direct Vector Access Index, you may encounter a generic failure message.\n{'status': 'FAILURE', 'result': {'success_row_count': 0, 'failed_primary_keys': ['1', '2']}}.\nCause\nThe length of the embedding field in the data you are trying to upsert does not match the\nembedding_dimension\nparameter defined during the creation of the Direct Vector Access Index.\nExample\nCreate a vector index, defining an\nembedding_dimension\nof 1024.\nclient = VectorSearchClient()\r\nindex = client.create_direct_access_index(\r\n    endpoint_name=\"storage_endpoint\",\r\n    index_name=\"{}.{}.{}\",\r\n    primary_key=\"id\",\r\n    embedding_dimension=1024, # <-------- This parameter\r\n    embedding_vector_column=\"text_vector\",\r\n    schema={\r\n     \"id\": \"int\",\r\n     \"field2\": \"str\",\r\n     \"field3\": \"float\",\r\n     \"text_vector\": \"array\"}\r\n)\nThen, upsert this snippet.\nindex.upsert([{\"id\": 1, \"field2\": \"value2\", \"field3\": 3.0,\r\n       \"text_vector\": [1.0, 2.0, 3.0] # <------- The embedding dimension is 4\r\n},\r\n{\"id\": 2, \"field2\": \"value2\", \"field3\": 3.0,\r\n\"text_vector\": [1.1, 2.1, 3.0] # <------- The embedding dimension is again 4.\r\n}])\nThe following error occurs, indicating a failure status. The rest of the error specifies the input IDs that caused the failure.\n{'status': 'FAILURE', 'result': {'success_row_count': 0, 'failed_primary_keys': ['1', '2']}}.\nThe data in the upsert snippet contains embeddings of dimension 4, but the defined\nembedding_dimension\nduring the vector index creation was 1024.\nSolution\nEnsure that the\nembedding_dimension\nparameter used during the creation of the Direct Vector Access Index matches the length of the embedding field in the data you are trying to upsert.\nIf the data's embedding field length is different from the defined\nembedding_dimension\n, recreate the index with the correct\nembedding_dimension\nvalue.\nFor example, if the embedding field length is 3, set\nembedding_dimension\nto 3 during index creation." +} \ No newline at end of file diff --git a/scraped_kb_articles/stop-all-scheduled-jobs.json b/scraped_kb_articles/stop-all-scheduled-jobs.json new file mode 100644 index 0000000000000000000000000000000000000000..09f30d5d2a11eb7f8569f16efe5fef1fce0e33f0 --- /dev/null +++ b/scraped_kb_articles/stop-all-scheduled-jobs.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/stop-all-scheduled-jobs", + "title": "Título do Artigo Desconhecido", + "content": "Under normal conditions, jobs run periodically and auto-terminate once their task is completed. In some cases, you may want to stop all scheduled jobs.\nFor more information on scheduled jobs, please review the\nCreate, run, and manage Databricks Jobs\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nThis article provides sample code that you can use to stop all of your scheduled jobs.\nDelete\nInfo\nIf you are a standard user, this will only stop scheduled jobs owned by you.\nIf you are a workspace admin, this will stop all scheduled jobs in the workspace.\nInstructions\nUse the following sample code to stop all of your scheduled jobs in the workspace.\nDelete\nInfo\nTo get your workspace URL, review\nWorkspace instance names, URLs, and IDs\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nGenerate a personal access token\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details on how to create a personal access token for use with the REST APIs.\nCopy and paste the sample code into a notebook cell.\nReplace the\n\nand\n\nvalues with ones specific to your workspace.\nRun the cell to stop all of your scheduled jobs in the workspace.\n%python\r\n\r\nimport requests\r\nimport json\r\n\r\nshard_url= \"\"\r\naccess_token= \"\"\r\nflag = 0\r\n\r\nheaders_auth = {\r\n 'Authorization': f'Bearer {access_token}'\r\n}\r\n\r\n\r\njobs_list = requests.request(\"GET\", job_list_url, headers=headers_auth).json()\r\n\r\n\r\nfor job in jobs_list['jobs']:\r\n    if \"schedule\" in job['settings']:\r\n        if job['settings']['schedule']['pause_status'] == \"UNPAUSED\":\r\n            flag += 1\r\n            schedule = job['settings']['schedule']\r\n            schedule['pause_status'] = \"PAUSED\"\r\n            job_name = job['settings']['name']\r\n            job_id = job['job_id']\r\n            \r\n            payload_pause_schedule = json.dumps({\r\n             \"job_id\": job['job_id'],\r\n             \"new_settings\": {\r\n             \"schedule\": schedule\r\n              }\r\n              })\r\n\r\n\r\n            response = requests.request(\"POST\", job_update_url, headers = headers_auth, data = payload_pause_schedule)\r\n            print(\"Pausing job \",job_id)\r\n\r\n\r\nif flag == 0:\r\n    print(\"No jobs to be paused\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/stream-to-stream-join-failure.json b/scraped_kb_articles/stream-to-stream-join-failure.json new file mode 100644 index 0000000000000000000000000000000000000000..af0ccb201fc907dd6ad68ceb71105c8390791b7a --- /dev/null +++ b/scraped_kb_articles/stream-to-stream-join-failure.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/stream-to-stream-join-failure", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are encountering an error when attempting to display a streaming DataFrame that is derived by performing a stream-stream join.\nCause\nWhen calling the\ndisplay\nmethod on a structured streaming DataFrame, the default settings utilize complete output mode and a memory sink. However, it's important to note that for stream-stream joins, the complete output mode is not supported. You can only use the append mode for a stream-stream join.\nSolution\nEnsure that the\nwriteStream\nmethod is used to load the data into the sink, and specifically select the append mode as the output mode.\nFor more information, refer to the Python example from the\nAppend mode\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n%python\r\n\r\n(events.writeStream\r\n .format(\"delta\")\r\n .outputMode(\"append\")\r\n .option(\"checkpointLocation\", \"/tmp/delta/_checkpoints/\")\r\n .start(\"/delta/events\")\r\n)" +} \ No newline at end of file diff --git a/scraped_kb_articles/stream-xml-auto-loader.json b/scraped_kb_articles/stream-xml-auto-loader.json new file mode 100644 index 0000000000000000000000000000000000000000..dd85ec90381fd73cb9037954aa492ded78aa519f --- /dev/null +++ b/scraped_kb_articles/stream-xml-auto-loader.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/stream-xml-auto-loader", + "title": "Título do Artigo Desconhecido", + "content": "Apache Spark does not include a streaming API for XML files. However, you can combine the auto-loader features of the Spark batch API with the OSS library, Spark-XML, to stream XML files.\nIn this article, we present a Scala based solution that parses XML data using an auto-loader.\nInstall Spark-XML library\nYou must install the\nSpark-XML\nOSS library on your Databricks cluster.\nReview the install a library on a cluster (\nAWS\n|\nAzure\n) documentation for more details.\nDelete\nInfo\nYou must ensure that the version of Spark-XML you are installing matches the version of Spark on your cluster.\nCreate the XML file\nCreate the XML file and use DBUtils (\nAWS\n|\nAzure\n) to save it to your cluster.\n%scala\r\n\r\nval xml2=\"\"\"\r\n  \r\n    25\r\n  \r\n  \r\n    30\r\n  \r\n  \r\n    30\r\n  \r\n\"\"\"\r\n\r\ndbutils.fs.put(\"//.xml\",xml2)\nDefine imports\nImport the required functions.\n%scala\r\n\r\nimport com.databricks.spark.xml.functions.from_xml\r\nimport com.databricks.spark.xml.schema_of_xml\r\nimport spark.implicits._\r\nimport com.databricks.spark.xml._\r\nimport org.apache.spark.sql.functions.{}\nDefine a UDF to convert binary to string\nThe streaming DataFrame requires data to be in string format.\nYou should define a user defined function to convert binary data to string data.\n%scala\r\n\r\n\r\nval toStrUDF = udf((bytes: Array[Byte]) => new String(bytes, \"UTF-8\"))\nExtract XML schema\nYou must extract the XML schema before you can implement the streaming DataFrame.\nThis can be inferred from the file using the\nschema_of_xml\nmethod from Spark-XML.\nThe XML string is passed as input, from the binary Spark data.\n%scala\r\n\r\nval df_schema = spark.read.format(\"binaryFile\").load(\"/FileStore/tables/test/xml/data/age/\").select(toStrUDF($\"content\").alias(\"text\"))\r\n\r\nval payloadSchema = schema_of_xml(df_schema.select(\"text\").as[String])\nImplement the stream reader\nAt this point, all of the required dependencies have been met, so you can implement the stream reader.\nUse readStream with binary and autoLoader listing mode options enabled.\nDelete\nInfo\nListing mode is used when working with small amounts of data. You can leverage\nfileNotificationMode\nif you need to scale up your application.\ntoStrUDF\nis used to convert binary data to string format (text).\nfrom_xml\nis used to convert the string to a complex struct type, with the user-defined schema.\n%scala\r\n\r\nval df = spark.readStream.format(\"cloudFiles\")\r\n  .option(\"cloudFiles.useNotifications\", \"false\") // Using listing mode, hence false is used\r\n  .option(\"cloudFiles.format\", \"binaryFile\")\r\n  .load(\"/FileStore/tables/test/xml/data/age/\")\r\n  .select(toStrUDF($\"content\").alias(\"text\")) // UDF to convert the binary to string\r\n  .select(from_xml($\"text\", payloadSchema).alias(\"parsed\")) // Function to convert string to complex types\r\n  .withColumn(\"path\",input_file_name) // input_file_name is used to extract the paths of input files\nView output\nOnce everything is setup, view the output of\ndisplay(df)\nin a notebook.\nExample notebook\nThis example notebook combines all of the steps into a single, functioning example.\nImport it into your cluster to run the examples.\nStreaming XML example notebook\nReview the\nStreaming XML example notebook\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/streaming-application-missing-data-from-a-delta-table-when-writing-to-a-given-destination.json b/scraped_kb_articles/streaming-application-missing-data-from-a-delta-table-when-writing-to-a-given-destination.json new file mode 100644 index 0000000000000000000000000000000000000000..7008c2a3a8eb68178bdf3bfff3f1c0bc65253718 --- /dev/null +++ b/scraped_kb_articles/streaming-application-missing-data-from-a-delta-table-when-writing-to-a-given-destination.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/streaming-application-missing-data-from-a-delta-table-when-writing-to-a-given-destination", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a streaming application to stream data from a Delta table and write to a given destination, you notice data loss.\nCause\nIn trying to separately address a failed streaming job by using\nstartingVersion=latest\n, the tradeoff is possible data loss. The restarted query will read only from the latest available Delta version of the source table, ignoring all versions before the query failed.\nExample\nIf the source Delta table has versions [0,1,2,3,4,5] and a streaming query fails while processing version 3, restarting with a new checkpoint folder and\nstartingVersion=latest\nwill cause the query to process from version 6 onwards, without going back to process versions [3,4,5].\nSolution\nRestart a streaming query on a new checkpoint folder.\nIdentify the version of the source table that was last processed by the streaming job. You can get this information either from the old checkpoint offset folder or from the last microbatch metrics. Below are the sample microbatch metrics where\nreservoirVersion\nrepresents the Delta version that the batch read.\n
\"endOffset\" : {\r\n      \"sourceVersion\" : 1,\r\n      \"reservoirId\" : \"aaaaa-aa-aa-aaaa\",\r\n      \"reservoirVersion\" : X,\r\n      \"index\" : -1,\r\n      \"isStartingVersion\" : false\r\n    },
\nStart the streaming job on the new checkpoint folder with\nstartingVersion\noption pointing to the next Delta version (X+1).\n
spark.readStream.format(\"delta\").option(\"startingVersion\", \"X+1\").load(\"\")\n3. If the use case allows, restart the stream without the\nstartingVersion\noption to process a complete snapshot of the source table, which is more suitable for merge workloads."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/streaming-job-failing-with-error-orgrocksdbrocksdbexception-too-many-open-files.json b/scraped_kb_articles/streaming-job-failing-with-error-orgrocksdbrocksdbexception-too-many-open-files.json
new file mode 100644
index 0000000000000000000000000000000000000000..8dc13e75072062107b4277a3ec0326fd52973b90
--- /dev/null
+++ b/scraped_kb_articles/streaming-job-failing-with-error-orgrocksdbrocksdbexception-too-many-open-files.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/streaming-job-failing-with-error-orgrocksdbrocksdbexception-too-many-open-files",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYour streaming job using Auto Loader fails with one of the following errors.\nError type 1\norg.rocksdb.RocksDBException: While open a file for random read: /local_disk0/tmp/spark-.sst: Too many open files\nError type 2\norg.rocksdb.RocksDBException: While opendir: /local_disk0/tmp/spark-: Too many open files\nError type 3\norg.rocksdb.RocksDBException: While open a file for appending: /local_disk0/tmp/spark-/MANIFEST-xxxx: Too many open files\nCause\nThe operating system has a file descriptor limit. When multiple Auto Loader streams are concurrently active on a single cluster, they put cumulative pressure on that limit.\nContext\nStructured streaming uses a\nRocksDB\nengine in features to manage stream progress, such as Auto Loader and state store for stateful streaming. RocksDB persists versioning state by writing key-value pairs into immutable SST files stored on disk, typically at\n/local_disk0/tmp/spark-\n.\nAs streaming data flows in, these SST files are frequently created, compacted, and rewritten. Under heavy streaming workloads, especially when multiple streams are active, the volume of SST files rapidly grows.\nThis growth results in the operating system exhausting the limit on simultaneously open file descriptors (\nulimit\n), particularly when the count of unique open SST files across all RocksDB instances on the node approaches or exceeds the default hard limit of 1 million files per root user.\nSolution\nMake changes to Apache Spark configurations applicable to Auto Loader, state store, or both. If necessary, also consider splitting a large number of streams across multiple job clusters, which will reduce the overhead of the number of open files across multiple streams.\nAuto Loader configuration change\nUse the\nreadStream\noption\ncloudFiles.maxFileAge\nto decrease the retention period of discovered files. Decreasing the retention period lowers the SST file count, reducing the RocksDB storage footprint per Auto Loader stream.\nFor more information, refer to the “Event retention” section of the\nConfigure Auto Loader for production workloads\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nState store configuration changes\nIn a notebook, or in your cluster settings under\nAdvanced options > Spark\nin the\nSpark config\nfield:\nSet the\nspark.sql.streaming.stateStore.rocksdb.compactOnCommit\nto\ntrue\nto enable performing a range compaction of RocksDB instance for commit operation.\nSet the\nspark.sql.streaming.stateStore.rocksdb.maxOpenFiles\nto\n10000\nor\n20000\n. This configuration is used to control the maximum number of open files allowed by RocksDB.\nAuto Loader and state store configuration change\nIn a notebook, or in your cluster settings under\nAdvanced options > Spark\nin the\nSpark config\nfield, lower the configuration\nspark.databricks.rocksDB.fileManager.compactSmallFilesThreshold\nto\n8\n. (The default is\n16\n).\nThis configuration controls how many\n<1MB\nfiles can be in any level of a RocksDB-based checkpoint before a compaction will be triggered. Reducing it will trigger compaction more frequently and reduce the number of SST files."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/streaming-job-failing-with-job-terminated-with-exception-error.json b/scraped_kb_articles/streaming-job-failing-with-job-terminated-with-exception-error.json
new file mode 100644
index 0000000000000000000000000000000000000000..f7d2e4633851609fe165271f0363afc29c0a0169
--- /dev/null
+++ b/scraped_kb_articles/streaming-job-failing-with-job-terminated-with-exception-error.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/streaming-job-failing-with-job-terminated-with-exception-error",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou start a streaming job from a data source using Structured Streaming (or Delta Live Tables). You then modify the query to including an additional streaming source using\nUNION\n.\nYour job fails with the following error.\nJob terminated with exception: assertion failed: There are [X] sources in the checkpoint offsets, and now there are [X+Y] sources requested by the query. Cannot continue. SQLSTATE: XXKST\nCause\nThe number of streaming sources in the query and the number of sources recorded in the existing checkpoint metadata do not match.\nWhen you create a stream with a single source, a checkpoint is established for this source. Structured Streaming (including Delta Live Tables) uses checkpointing to track progress. Each source's metadata (for example, file offsets or Kafka partitions) is stored in the checkpoint.\nIf you modify the query to add another streaming source after starting the job, the query now has two sources, but the initial checkpoint only knows about the first source.\nApache Spark detects the inconsistency and throws the\nJob terminated\nerror. Spark does not allow the number or type of sources to change across restarts when using the same checkpoint.\nSolution\nFor Spark Streaming, start with a fresh checkpoint when you need to make changes. If you are using a Delta Live Table (DLT) pipeline, perform a full refresh of the pipeline.\nFor more information, refer to the\nRun an update in\nLakeflow Declarative Pipelines\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/streaming-job-not-sending-data-to-eventhub-and-failing-with-timeoutexception-message.json b/scraped_kb_articles/streaming-job-not-sending-data-to-eventhub-and-failing-with-timeoutexception-message.json
new file mode 100644
index 0000000000000000000000000000000000000000..27200623940d39e4d403d1cf3ac227ca82e67b24
--- /dev/null
+++ b/scraped_kb_articles/streaming-job-not-sending-data-to-eventhub-and-failing-with-timeoutexception-message.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/streaming-job-not-sending-data-to-eventhub-and-failing-with-timeoutexception-message",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou’re attempting to write data from a Delta table to an Event Hub in a streaming job. The job fails without sending any data to the Event Hub and your event log shows multiple timeout messages.\nCaused by: com.microsoft.azure.eventhubs.TimeoutException: Entity(XXXXXX): Send operation timed out\nCause\nThe\nmaxBytesPerTrigger\noption, which controls the batch size in bytes, is not set by default. Without this option, Apache Spark uses the default\nmaxFilesPerTrigger\n, which is set to\n1000\n.\nSolution\nSet\nmaxBytesPerTrigger\naccording to your Event Hub tier rate limit. For details on limits, review the\nAzure Event Hubs quotas and limits\ndocumentation.\nNote\nIf you continue using\nmaxFilesPerTrigger\nalong with\nmaxBytesPerTrigger\n,  your job respects whichever comes first.\nFor more information, review the\nConfigure Structured Streaming batch size on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/streaming-job-stuck-writing-checkpoint.json b/scraped_kb_articles/streaming-job-stuck-writing-checkpoint.json
new file mode 100644
index 0000000000000000000000000000000000000000..065c0b03876df50cfc7609bf7bc148419f03a705
--- /dev/null
+++ b/scraped_kb_articles/streaming-job-stuck-writing-checkpoint.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/streaming-job-stuck-writing-checkpoint",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are monitoring a streaming job, and notice that it appears to get stuck when processing data.\nWhen you review the logs, you discover the job gets stuck when writing data to a checkpoint.\nINFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/state/0/89]:\r\nINFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef@56a4cb80\r\nINFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=37),dir = dbfs:/FileStore/R_CHECKPOINT5/state/0/37]:\r\nINFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef@56a4cb80\r\nINFO HDFSBackedStateStoreProvider: Deleted files older than 313920 for HDFSStateStoreProvider[id = (op=0,part=25),dir = dbfs:/FileStore/PYTHON_CHECKPOINT5/state/0/25]:\nCause\nYou are trying to use a checkpoint location in your local DBFS path.\n%scala\r\n\r\nquery = streamingInput.writeStream.option(\"checkpointLocation\", \"/FileStore/checkpoint\").start()\nSolution\nYou should use persistent storage for streaming checkpoints.\nYou should not use DBFS for streaming checkpoint storage."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/streamlit-app-deployed-as-databricks-app-failing-with-java_gateway_exited-error.json b/scraped_kb_articles/streamlit-app-deployed-as-databricks-app-failing-with-java_gateway_exited-error.json
new file mode 100644
index 0000000000000000000000000000000000000000..8e74ab5ee33f6bed2fa4cda3ed085dd89fcac6c3
--- /dev/null
+++ b/scraped_kb_articles/streamlit-app-deployed-as-databricks-app-failing-with-java_gateway_exited-error.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/machine-learning/streamlit-app-deployed-as-databricks-app-failing-with-java_gateway_exited-error",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYour Streamlit application deployed as a Databricks App contains a call such as the following.\nspark = SparkSession.builder.appName(\"QuestionAnswer\").getOrCreate()\nThe app crashes on startup and the container throws the following error.\nPySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number\nCause\nDatabricks Apps are lightweight, container-based runtimes designed for UI rendering and light orchestration. They do not ship with an Apache Spark driver, executor, or JVM.\nAny call that instantiates a SparkSession (or lower-level SparkContext) tries to start the Java gateway and fails, producing the\n[JAVA_GATEWAY_EXITED]\nerror.\nSolution\nDatabricks Apps should delegate compute to an existing Databricks cluster or to Databricks SQL instead of attempting to create Spark locally.\nReplace the direct SparkSession instantiation with one of the supported remote-connection SDKs or drivers, such as:\nDatabricks SDK for Python\n(\nAWS\n|\nAzure\n|\nGCP\n)\nDatabricks SQL connector to Python\n(\nAWS\n|\nAzure\n|\nGCP\n)\nSQLAlchemy\n(\nAWS\n|\nAzure\n|\nGCP\n)."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/string-aggregation-queries-failing-with-data-type-mismatch-error-on-serverless-compute.json b/scraped_kb_articles/string-aggregation-queries-failing-with-data-type-mismatch-error-on-serverless-compute.json
new file mode 100644
index 0000000000000000000000000000000000000000..34aa4542f5e84f311b7b04ae67f27c068927b5b5
--- /dev/null
+++ b/scraped_kb_articles/string-aggregation-queries-failing-with-data-type-mismatch-error-on-serverless-compute.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/jobs/string-aggregation-queries-failing-with-data-type-mismatch-error-on-serverless-compute",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen running string aggregation queries on a serverless compute, you receive the following error message.\n[DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE] `Cannot resolve '(col1 + col2)' due to data type mismatch: the binary operator requires the input type ('NUMERIC' or 'INTERVAL DAY TO SECOND' or 'INTERVAL YEAR TO MONTH' or 'INTERVAL'), not 'STRING'.`\nYou notice the same queries work in an all-purpose compute.\nCause\nThe ANSI_MODE setting behaves differently on a serverless compute than on an all-purpose compute. On a serverless compute, ANSI_MODE is enabled by default, enforcing stricter data type rules and preventing implicit cross-casting.\nOn an all-purpose compute, ANSI_MODE is disabled by default, allowing for implicit cross-casting of data types during query execution.\nSolution\nThere are two options available to resolve the issue. Consider which best applies to your use case.\nExplicitly cast the string columns to the desired numeric data type before performing arithmetic operations. The following code provides an example.\n%sql \r\nSELECT SUM(CAST(col1 AS DOUBLE) + CAST(col2 AS DOUBLE)) FROM \nAlternatively, run the following code (either SQL or Python) in your notebook to disable ANSI_MODE on the serverless compute for the duration of the session.\n%sql\r\nset ansi_mode = False\r\n\r\n%python\r\nspark.conf.set(\"spark.sql.ansi.enabled\", False)\nThen rerun your query on the table.\n%sql\r\nselect sum(col1 + col2) from \nImportant\nIf data overflows when ANSI_MODE is disabled, the job may fail with an overflow error.\nFor more information, refer to the\nANSI_MODE\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/structured-streaming-does-not-process-batch-size-reduction-after-a-failed-transaction.json b/scraped_kb_articles/structured-streaming-does-not-process-batch-size-reduction-after-a-failed-transaction.json
new file mode 100644
index 0000000000000000000000000000000000000000..448673c7d5f9c1e2cbed07f961dbbb3d600c11fa
--- /dev/null
+++ b/scraped_kb_articles/structured-streaming-does-not-process-batch-size-reduction-after-a-failed-transaction.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/structured-streaming-does-not-process-batch-size-reduction-after-a-failed-transaction",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are attempting to adjust the\nmaxFilesPerTrigger\nor\nmaxBytesPerTrigger\nsettings to control the amount of data processed in each Structured Streaming micro-batch. However, after a failed transaction, these changes do not take effect. The stream continues to use the previous batch settings, ignoring any new configurations.\nCause\nWhen a micro-batch fails, the system has already created and stored offset information in the checkpoint directory. These offset files are not automatically overwritten when the stream restarts or when configurations are changed.\nThe problem occurs because the offset files created during a failed micro-batch remain in the checkpoint directory. Changes to configurations like\nmaxFilesPerTrigger\nor\nmaxBytesPerTrigger\nare only applied to new offset calculations. The stream continues to use the existing offset information from the failed batch, ignoring the updated configurations.\nThis behavior leads to a situation where the stream doesn't immediately adapt to the new rate limiting settings, instead still using the offset information from the previous, failed attempt. The new configurations only take effect for completely new micro-batches, not for retries of failed batches.\nSolution\nRun the command\n%fs ls \nin a notebook to list the files in the table checkpoint location.\nIdentify the offset file that does not have a corresponding commits file in the checkpoint location.\nLocate the latest file inside the offset folder and obtain its name.\nEnsure there is no corresponding file with that name in the commits file folder.\nMake a backup of this offset file and store it in an external location so you can return to a known state if needed.\nDelete the offset file from its original location in the checkpoint path.\nRestart the stream.\nThis process ensures that the stream picks up the new configuration and resolves the batch size reduction issue.\nAdditional considerations\nMonitor the checkpoint directory for any discrepancies between the offset and commit files.\nImplement automated scripts to check for and resolve such mismatches proactively."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/structured-streaming-job-fails-with-a-streaming-query-exception-when-a-schema-changes-in-the-source-table.json b/scraped_kb_articles/structured-streaming-job-fails-with-a-streaming-query-exception-when-a-schema-changes-in-the-source-table.json
new file mode 100644
index 0000000000000000000000000000000000000000..6976a58a01e1666e66bf9fa6607e7923d45be6b3
--- /dev/null
+++ b/scraped_kb_articles/structured-streaming-job-fails-with-a-streaming-query-exception-when-a-schema-changes-in-the-source-table.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/structured-streaming-job-fails-with-a-streaming-query-exception-when-a-schema-changes-in-the-source-table",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou have a streaming job that is ingesting data from a Delta table. Some columns in the Delta table may have been renamed or dropped (schema evolution) and you get a\nStreamingQueryException: [STREAM_FAILED]\nerror message.\nStreamingQueryException: [STREAM_FAILED] Query [id = XXX, runId = XXXX] terminated with exception: The schema, table configuration or protocol of your Delta table has changed during streaming.The schema or metadata tracking log has been updated.Please restart the stream to continue processing using the updated metadata.\nCause\nIf you add, drop, or rename any column in the source table, the streaming job fails.\nSolution\nUpdate the required schema definition either at source or target and restart the streaming query to continue processing the job.\nFor non additive schema changes such as rename or dropping columns, enable schema tracking. For the scenario to work, the schema must be specified, and each streaming read against a data source must have its own\nschemaTrackingLocation\nspecified. For more information, review the\nRename and drop columns with Delta Lake column mapping\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation. This ensures that schema changes are properly tracked.\nSet\nspark.databricks.delta.streaming.allowSourceColumnRenameAndDrop\nto true.\nRestart the streaming query.\nNote\nThis is supported in Databricks Runtime 13.3 LTS and above. If your workflow has non-additive schema changes such as renaming or dropping columns, this configuration is a good choice. Otherwise this configuration is not needed."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/structured-streaming-jobs-slow-down-on-every-10th-batch.json b/scraped_kb_articles/structured-streaming-jobs-slow-down-on-every-10th-batch.json
new file mode 100644
index 0000000000000000000000000000000000000000..5dfb88e74079fbbb97beb9cc7caefaf2d32c3b6a
--- /dev/null
+++ b/scraped_kb_articles/structured-streaming-jobs-slow-down-on-every-10th-batch.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/structured-streaming-jobs-slow-down-on-every-10th-batch",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are running a series of structured streaming jobs and writing to a file sink. Every 10th run appears to run slower than the previous jobs.\nCause\nThe file sink creates a\n_spark_metadata\nfolder in the target path. This metadata folder stores information about each batch, including which files are part of the batch. This is required to provide an exactly-once guarantee for file sink streaming. By default, on every 10th batch, the previous nine batch data files are compacted into a single file at\n//data/_spark_metadata/9.compact\n.\nSolution\nThere are three possible solutions. Choose the one that is most appropriate for your situation.\nOption 1: Mitigates the issue in a production environment, with minimal code changes, but retains less metadata.\nOption 2: Recommended if you can switch to using Delta tables. This is a good long-term solution.\nOption 3: Recommended if the pipeline doesn't require exactly-once semantics or downstream can handle duplicates.\nOption 1: Shorten metadata retention time\nThe metadata folder grows larger over time by default. To mitigate this, you can set a maximum retention time for the output files. Files older than the retention period are automatically excluded, which limits the number of files in the metadata folder. Fewer files in the metadata folder means compaction takes less time.\nSet the retention period when you write the streaming DataFrame to your file sink:\n%python\r\n\r\ncheck_point = ''\r\ntarget_path = ''\r\nretention = '' # You can provide the value as string format of the time in hours or days. For example, \"12h\", \"7d\", etc. This value is disabled by default\r\n\r\ndf.writeStream.format('json').mode('append').option('checkPointLocation', check_point).option('path', target-path).option('retention', retention).start()\nDelete\nInfo\nRetention defines the time to live (TTL) for output files. Output files committed before the TTL range are excluded from the metadata log. Attempts to read the sink's output directory will not process any files older than the TTL range.\nOption 2: Use a Delta table as the sink\nDelta tables do not use a\nspark_metadata\nfolder and they provide exactly-once semantics.\nFor more information, please review the documentation on using a Delta table as a sink (\nAWS\n|\nAzure\n|\nGCP\n).\nOption 3: Use\nforeachBatch\nforeachBatch\ndoes not create a\nspark_metadata\nfolder when writing to the sink.\nDelete\nWarning\nExactly-once semantics are not supported with\nforeachBatch\n. Only use\nforeachBatch\nif you are certain that your application does not require exactly-once semantics.\nThis warning can be disregarded if you are writing to a Delta table."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/structured-streaming-workflow-reading-data-from-cdc-is-failing.json b/scraped_kb_articles/structured-streaming-workflow-reading-data-from-cdc-is-failing.json
new file mode 100644
index 0000000000000000000000000000000000000000..db964c0880505c4de560788a5aeb1ac304b9fbb2
--- /dev/null
+++ b/scraped_kb_articles/structured-streaming-workflow-reading-data-from-cdc-is-failing.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/structured-streaming-workflow-reading-data-from-cdc-is-failing",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen running a structured streaming job that reads data from a change data capture (CDC) table and joins it with another streaming table, you notice the job initially runs successfully for a few batches but then fails with the following error.\n[STREAM_FAILED] Query [id = XXX, runId = XXX] terminated with exception: Job aborted due to stage failure: Task 0 in stage 18.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18.0 (TID 953) (100.XX.YY.112 executor 11): org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field. - Provided key schema: StructType(StructField(field0,StringType,true),StructField(index,LongType,true)) - Existing key schema: StructType(StructField(field0,StringType,true),StructField(index,LongType,true)) - Provided value schema: StructType(StructField(c_name_1,StringType,true),StructField(c_name_2,StringType,true),StructField(c_name_3,StringType,true),StructField(c_name_4,StringType,true),StructField(c_name_5,StringType,true),StructField(c_name_6,StringType,true),StructField(c_name_7,StringType,true)) - Existing value schema: StructType(StructField(c_name_1,StringType,false),StructField(c_name_2,StringType,true),StructField(c_name_3,StringType,true),StructField(c_name_4,StringType,true),StructField(c_name_5,StringType,true),StructField(c_name_6,StringType,true),StructField(c_name_7,StringType,true)) If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false. Please note running query with incompatible schema could cause indeterministic behavior.\nCause\nThe provided schema does not match the existing schema for the state. This issue can occur due to schema of the data being read from the CDC table changing over time, causing a schema mismatch with the existing state.\nSpecifically, there is a change in the nullability of the\n\"c_name_1\"\nfield. It was once\n`StructField(c_name_1,StringType,false)`\nand changed to\n`StructField(c_name_1,StringType,true)`\n.\nSolution\nSet an Apache Spark configuration to avoid the schema mismatch issues caused by the change in nullable value. When true, the state schema checker does not check the nullability of columns.\nNavigate to your cluster and open the settings.\nClick\nAdvanced options\n.\nUnder the\nSpark\ntab, in the\nSpark config\nbox, enter the following code.\nspark.databricks.streaming.stateStore.stateSchemaCheck.ignoreNullCompatibility true\nAlternatively, add the previous code in your notebook using the\nspark.conf.set()\ncommand.\nNote\nIf the input schema previously marked a column as nullable but later changes the column to non-nullable, stateful operators may read older rows from the state store where the column's value is null. This may result in downstream output containing null values for a column that is expected to be non-nullable."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/symlink-format-manifest-fails-when-trying-to-enable-liquid-clustering-on-a-table.json b/scraped_kb_articles/symlink-format-manifest-fails-when-trying-to-enable-liquid-clustering-on-a-table.json
new file mode 100644
index 0000000000000000000000000000000000000000..54bc709c5202bb4b70743958a82f10841f768c62
--- /dev/null
+++ b/scraped_kb_articles/symlink-format-manifest-fails-when-trying-to-enable-liquid-clustering-on-a-table.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta/symlink-format-manifest-fails-when-trying-to-enable-liquid-clustering-on-a-table",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen you try to enable liquid clustering on a table using the Symlink format manifest, it fails with the following error.\nDELTA_VIOLATE_TABLE_PROPERTY_VALIDATION_FAILED.PERSISTENT_DELETION_VECTORS_WITH_INCREMENTAL_MANIFEST_GENERATION.\nCause\nWhen enabling liquid clustering, deletion vectors are enabled by default. Deletion vectors and Symlink format manifest cannot be used together when using liquid clustering, so the Symlink format manifest fails.\nContext\nSymlink format manifest allows querying a Delta table with external engines like Presto or Athena.\nDeletion vectors enhance Delta table performance by optimizing update operations.\nSolution\nDisable deletion vectors or avoid using Symlink format manifest, as appropriate for your context.\nIf you need to query your tables using external engines, you need to use Symlink format manifest. Disable deletion vectors for your table.\nALTER TABLE ..\r\nSET TBLPROPERTIES ('delta.enableDeletionVectors' = false);\nOtherwise, disable Symlink format manifest instead. Deletion vectors help performance so are often a stronger choice.\nALTER TABLE .. \r\nSET TBLPROPERTIES ('delta.enableSymlinkManifestGeneration' = false);\nFor more information, review the\nUse liquid clustering for Delta tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/sync-command-fails-with-a-mismatched-input-error.json b/scraped_kb_articles/sync-command-fails-with-a-mismatched-input-error.json
new file mode 100644
index 0000000000000000000000000000000000000000..27a16ca498d09326686ea2f77fc43626db5a5584
--- /dev/null
+++ b/scraped_kb_articles/sync-command-fails-with-a-mismatched-input-error.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/unity-catalog/sync-command-fails-with-a-mismatched-input-error",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nThe\nSYNC\ncommand can be used to migrate legacy external Apache Hive tables to Unity Catalog.\nWhen you run\nSYNC\nin a Databricks notebook, it fails with a\nmismatched input 'schema' expecting 'MATERIALIZED'\nerror.\nSYNC schema . from hive_metastore. DRY RUN\ncom.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'schema' expecting 'MATERIALIZED'(line 1, pos 5)\nCause\nThis error occurs if Unity Catalog is not enabled on your cluster.\nSolution\nVerify that Unity Catalog is enabled in your workspace and on your clusters.\nUnity Catalog is supported with\nSingle user\nand\nShared\naccess mode on clusters running Databricks Runtime 11.3 LTS and above. Unity Catalog is not supported on\nNo isolation shared\nclusters."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/sync-fails-with-%5Bupgrade_not_supportedhive_serde%5D-table-is-not-eligible-for-upgrade-from-hive-metastore-to-unity-catalog.json b/scraped_kb_articles/sync-fails-with-%5Bupgrade_not_supportedhive_serde%5D-table-is-not-eligible-for-upgrade-from-hive-metastore-to-unity-catalog.json
new file mode 100644
index 0000000000000000000000000000000000000000..9f8328b5ca436b19f6d68500097fdd2a3e9e8ad7
--- /dev/null
+++ b/scraped_kb_articles/sync-fails-with-%5Bupgrade_not_supportedhive_serde%5D-table-is-not-eligible-for-upgrade-from-hive-metastore-to-unity-catalog.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/data/sync-fails-with-%5Bupgrade_not_supportedhive_serde%5D-table-is-not-eligible-for-upgrade-from-hive-metastore-to-unity-catalog",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhile trying to upgrade a table from Hive metastore to Unity Catalog, you encounter the following error.\n[UPGRADE_NOT_SUPPORTED.HIVE_SERDE] Table is not eligible for upgrade from Hive Metastore to Unity Catalog. Reason: Hive SerDe table. SQLSTATE: 0AKUC.\nCause\nThe error occurs because the Unity Catalog\nSYNC\ncommand cannot process tables created using the Hive SerDe format in the\nhive_metastore\ncatalog.\nThis issue arises during the\nSYNC\nprocess and affects tables located in the cloud storage. The table was previously accessible in Databricks using the\nhive_metastore\ncatalog but now encounters obstacles during migration to UC.\nSolution\n1. Convert your Hive SerDe tables to Delta format.\nCONVERT TO DELTA hive_metastore..;\n2. Issue the\nSYNC\ncommand to upgrade the tables to Unity Catalog.\nSYNC hive_metastore.. TO unity_catalog..;\nFor more information, please refer to the\nSYNC\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/syntax-error-when-running-vacuum-with-using-inventory-command.json b/scraped_kb_articles/syntax-error-when-running-vacuum-with-using-inventory-command.json
new file mode 100644
index 0000000000000000000000000000000000000000..d2fb6861878dd9d0a419d5d04ce1e7c11091eaaf
--- /dev/null
+++ b/scraped_kb_articles/syntax-error-when-running-vacuum-with-using-inventory-command.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta/syntax-error-when-running-vacuum-with-using-inventory-command",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are attempting to run Delta\nVACUUM\nwith the\nUSING INVENTORY\noperation according to the documentation,\nEfficient Delta Vacuum with File Inventory\n.  You use an SQL command such as the following, and get a syntax error.\nSQL command\nVACUUM . using inventory (select 's3://'||bucket||'/'||key as path, length, isDir, modificationTime\r\nfrom inventory.datalake_report\r\nwhere bucket = ''\r\nand table = '.'\r\n)\r\nRETAIN 24 HOURS\nError\n[PARSE_SYNTAX_ERROR] Syntax error at or near 'VACUUM'.\nCause\nYour cluster is using a Databricks Runtime version below 15.2. Vacuum inventory support was released as part of Databricks Runtime starting with 15.2.\nSolution\nUpgrade your cluster's Databricks Runtime to 15.2 or above to be able to run\nVACUUM\nwith the\nUSING INVENTORY\nSQL command."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/table-create-security-exception.json b/scraped_kb_articles/table-create-security-exception.json
new file mode 100644
index 0000000000000000000000000000000000000000..a75d16fe8d04c3b44ad67fb3e3148daaaa362b2f
--- /dev/null
+++ b/scraped_kb_articles/table-create-security-exception.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/security/table-create-security-exception",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou attempt to create a table using a cluster that has Table ACLs enabled, but the following error occurs:\nError in SQL statement: SecurityException: User does not have permission SELECT on any file.\nCause\nThis error occurs on a Table ACL-enabled cluster if you are not an administrator and you do not have sufficient privileges to create a table.\nAWS\nFor example, in your notebook you attempt to create a table using a Parquet data source located on S3:\n%sql\r\n\r\nCREATE TABLE mytable\r\n  USING PARQUET\r\n  OPTIONS (PATH='s3://my-root-bucket/subfolder/my-table')\nDelete\nAzure\nFor example, in your notebook you attempt to create a table using a Parquet data source located on Azure Blob Storage:\n%sql\r\n\r\nCREATE TABLE mytable\r\n  USING PARQUET\r\n  OPTIONS (PATH='wasbs://my-container@my-storage-account.blob.core.windows.net/my-table')\nDelete\nSolution\nYou should ask your administrator to grant you access to the blob storage filesystem, using either of the following options. If an administrator cannot grant you access to the data object, you’ll have to ask an administrator to make the table for you.\nIf you want to use a\nCTAS (CREATE TABLE AS SELECT)\nstatement to create the table, the administrator should grant you\nSELECT\nprivileges on the filesystem:\n%sql\r\n\r\nGRANT SELECT ON ANY FILE TO `user1`\nExample\nCTAS\nstatement:\nAWS\n%sql\r\nCREATE TABLE mytable\r\n      AS SELECT * FROM parquet.`s3://my-root-bucket/subfolder/my-table`\nDelete\nAzure\n%sql\r\n\r\nCREATE TABLE mytable\r\n      AS SELECT * FROM parquet.`wasbs://my-container@my-storage-account.blob.core.windows.net/my-table`\nDelete\nIf you want to use a\nCTOP (CREATE TABLE OPTIONS PATH)\nstatement to make the table, the administrator must elevate your privileges by granting\nMODIFY\nin addition to\nSELECT\n.\n%sql\r\n\r\nGRANT SELECT, MODIFY ON ANY FILE TO `user1`\nExample\nCTOP\nstatement:\nAWS\n%sql\r\n\r\nCREATE TABLE mytable\r\n   USING PARQUET\r\n   OPTIONS (PATH='s3://my-root-bucket/subfolder/my-table')\nDelete\nAzure\n%sql\r\n\r\nCREATE TABLE mytable\r\n   USING PARQUET\r\n   OPTIONS (PATH='wasbs://my-container@my-storage-account.blob.core.windows.net/my-table')\nDelete\nDelete\nWarning\nIt is important to understand the security implications of granting ANY FILE permissions on a filesystem. You should only grant ANY FILE to privileged users. Users with lower privileges on the cluster should never access data by referencing an actual storage location. Instead, they should access data from tables that are created by privileged users, thus ensuring that Table ACLS are enforced.\nIn addition, if files in the Databricks root and data buckets are accessible by the cluster and users have MODIFY privileges, the admin should lock down the root.\nAWS\nGranting the data access privileges described above does not supersede any underlying IAM roles or S3 bucket policies. For example, if a grant statement like\nGRANT SELECT, MODIFY ON ANY FILE TO user1\nis executed but an IAM role attached to the cluster explicitly denies reads to the target S3 bucket, then the\nGRANT\nstatement will not make the bucket or the objects within the bucket suddenly readable.\nDelete\nAzure\nGranting the data access privileges described above does not supersede any underlying user permissions or Blob Storage container access control. For example, if a grant statement like\nGRANT SELECT, MODIFY ON ANY FILE TO user1\nis executed but a user permission attached to the cluster explicitly denies reads to the target container, then the\nGRANT\nstatement will not make the container or the objects within the container suddenly readable.\nDelete"
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/table-not-available-while-creating-automl-experiment-model-.json b/scraped_kb_articles/table-not-available-while-creating-automl-experiment-model-.json
new file mode 100644
index 0000000000000000000000000000000000000000..393376f16973d9c44cd440eaea19559cff2d7071
--- /dev/null
+++ b/scraped_kb_articles/table-not-available-while-creating-automl-experiment-model-.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/machine-learning/table-not-available-while-creating-automl-experiment-model-",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYour Hive metastore tables are not visible when selecting the\nInput training dataset\nin AutoML via the user interface (UI).\n(The navigation path:\nWorkspace\n→\nExperiments\n→\nCreate AutoML Experiment\n→\nExperiment Configuration\n→\nInput training dataset\n)\nCause\nClusters with Hive metastore tables have a 2MB limit. If the schema contains thousands of tables, the AutoML UI cannot load – and therefore show – all the tables stored in the schema.\nSolution\nMigrate to Unified Compute (UC), which does not have the same 2 MB constraint.\nNote\nGenerally, Databricks recommends using Unity Catalog instead of Hive metastore. Unity Catalog has enhanced metadata management and governance capabilities. For more information, please review the\nUpgrade Hive tables & views to Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIn the meantime, you can use AutoML with the Python API, which allows you to bypass the UI limitations and specify the desired tables directly in the code.\nTo follow the steps to execute, please review the\nTrain ML models with Databricks AutoML Python API\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/table-or-view-not-found-error-when-trying-to-query-a-federated-table-using-sql-serverless-compute.json b/scraped_kb_articles/table-or-view-not-found-error-when-trying-to-query-a-federated-table-using-sql-serverless-compute.json
new file mode 100644
index 0000000000000000000000000000000000000000..fc59df76baf42e6225d01bd0274d2acd41e2f12b
--- /dev/null
+++ b/scraped_kb_articles/table-or-view-not-found-error-when-trying-to-query-a-federated-table-using-sql-serverless-compute.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/data-sources/table-or-view-not-found-error-when-trying-to-query-a-federated-table-using-sql-serverless-compute",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen trying to query a federated table using SQL serverless compute, you receive an error message.\ncom.databricks.backend.common.rpc.DatabricksDatabaseException: org.apache.spark.sql.AnalysisException: Table or view not found: Schema.Table;\r\nCaused by: org.apache.spark.SparkException: com.microsoft.sqlserver.jdbc.SQLServerException: Reason: An instance-specific error occurred while establishing a connection to SQL Server. Connection was denied since Deny Public Network Access is set to Yes (https://docs.microsoft.com/azure/azure-sql/database/connectivity-settings#deny-public-network-access). To connect to this server, use the Private Endpoint from inside your virtual network (https://docs.microsoft.com/azure/sql-database/sql-database-private-endpoint-overview#how-to-set-up-private-link-for-azure-sql-database). ClientConnectionId:XXXXXX\nCause\nYou have an NCC private endpoint with public network access disabled. When public network access is disabled, the serverless control plane is not allowed to access the SQL server. Only connections from private endpoints are allowed.\nSolution\nExplicitly set up private connectivity for serverless.\nFirst, validate connection status using serverless to your SQL server. Run the following command in a notebook.\n%sh\r\nnc -vz DB_NAME-prod.mysql.database.azure.com:3306\nIf the serverless connection is not allowed, the following message is returned.\nConnectException: Connection timed out (Connection timed out)\nThen, set up NCC connectivity for the serverless environment using the Microsoft\nConfigure private connectivity from serverless compute\ndocumentation.\nFor more information, refer to the Microsoft\nEnable Azure Private Link as a simplified deployment\ndocumentation. For more information particularly about denying public network access, refer to the Microsoft\nConnectivity settings for Azure SQL Database and Azure Synapse Analytics\ndocumentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/table-writes-failing-when-trying-to-read-from-a-delta-table.json b/scraped_kb_articles/table-writes-failing-when-trying-to-read-from-a-delta-table.json
new file mode 100644
index 0000000000000000000000000000000000000000..b23c0f8910d7d28198418461a4af7af65da68e0a
--- /dev/null
+++ b/scraped_kb_articles/table-writes-failing-when-trying-to-read-from-a-delta-table.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/python/table-writes-failing-when-trying-to-read-from-a-delta-table",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYour table writes are failing when attempting to read from a Delta table and returning the following error.\nCaused by: java.lang.RuntimeException: Corrupted parquet page (14)\r\n…\nCause\nThis error often occurs when an external system overwrites files that are managed by Delta, even though the message says\nCorrupted parquet page\n.\nThe sequence of events looks like the following example. The timestamps are to show the order of events and that the corrupt page error refers to the initial Parquet file ingestion at the beginning of the sequence.\n12:39 AM: Parquet file ingested into Databricks (not written using Databricks).\n1:00 AM: Reads fail for an unknown reason.\n6:09 AM: File is rewritten using a Databricks writer but not Delta.\n6:30 AM: File read throws the corrupt page error. This file reads the file with a modification time of 12:39 AM.\nSolution\nUpgrade to Databricks Runtime 13.3 LTS or above. Databricks Runtime 13.3 LTS improves file consistency checks and produces a clearer error message which you can then action.\nCaused by: com.databricks.common.filesystem.InconsistentReadException: The file might have been updated during query execution. Ensure that no pipeline updates existing files during query execution and try again.\nTo temporarily mitigate the issue, you can read the files with the OSS vectorized reader. To use the vectorized reader, set the following Apache Spark confs for your cluster.\nspark.sql.parquet.enableVectorizedReader true\r\nspark.databricks.io.parquet.fastreader.enabled false\r\nspark.databricks.io.parquet.nativeReader.enabled false\nFor details on how to apply Spark configs, refer to the “Spark configuration” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/table_online_vector_index_replica-does-not-support-lakehouse-federation.json b/scraped_kb_articles/table_online_vector_index_replica-does-not-support-lakehouse-federation.json
new file mode 100644
index 0000000000000000000000000000000000000000..0987882cc2094ef3ebc69ad93d40f5c92e380f42
--- /dev/null
+++ b/scraped_kb_articles/table_online_vector_index_replica-does-not-support-lakehouse-federation.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/machine-learning/table_online_vector_index_replica-does-not-support-lakehouse-federation",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen attempting to query a Vector Search index as a table, you receive the following error message.\n[RequestId=918d8ca7-8c5b-4956-8f9a-012001365310 ErrorClass=INVALID_PARAMETER_VALUE.SECURABLE_KIND_DOES_NOT_SUPPORT_LAKEHOUSE_FEDERATION] Securable with kind TABLE_ONLINE_VECTOR_INDEX_REPLICA does not support Lakehouse Federation.\nCause\nUsing SQL to query a vector index is not supported in Databricks.\nSolution\nUse the Python SDK or the REST API for querying a vector search endpoint.\nAlternatively, employ the\nvector_search\nfunction directly within SQL to query a vector search index. The\nvector_search\nfunction follows the syntax\nvector_search(index, query, num_results)\n.\nExample\nSELECT * FROM VECTOR_SEARCH(index => \"main.db.my_index\", query => \"iphone\", num_results => 2)\nFor more information, please review the\nHow to create and query a vector search index\nand\nvector_search function\ndocumentation.\nFor details on the vector search Python SDK, please review the\ndatabricks.vector_search package\ndocumentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/tackling-schema-issues-that-arise-for-ml-models-trained-outside-of-databricks.json b/scraped_kb_articles/tackling-schema-issues-that-arise-for-ml-models-trained-outside-of-databricks.json
new file mode 100644
index 0000000000000000000000000000000000000000..32922b54661c77a795ddfc8b08078586b1a12107
--- /dev/null
+++ b/scraped_kb_articles/tackling-schema-issues-that-arise-for-ml-models-trained-outside-of-databricks.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/machine-learning/tackling-schema-issues-that-arise-for-ml-models-trained-outside-of-databricks",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen you are attempting to register a machine learning model from Hugging Face that was trained outside of Databricks, the model fails with the error message\nFailed to infer Schema\n:\n`MLFlowException: Failed to infer Schema. Expected one of the following types:\r\n-pandas.DataFrame\r\n-pandas.Series…\r\nFile /databricks/python/lib/python3.11/site-packages/mlflow/types/utils.py:374 in infer_schema(data)...`\nCause\nDatabricks expects model artifacts to follow a specific structure. When calling\nmlflow..log_model\n, MLflow arranges the model's artifacts properly for correct loading. If you attempt to register a model trained outside of Databricks or try to fine-tune it with additional data in a Databricks notebook, this may lead to a\nFailed to infer Schema\nerror due to the artifact structure not aligning with Databricks' expectations for Hugging Face models.\nSolution\nThis issue arises in Databricks environments when working with machine learning models, particularly those trained outside of Databricks. To address the fact that the model artifacts are in the structure expected by Databricks, the entire code from the external environment should be used to retrain the model within Databricks before fine-tuning.\nTo resolve this issue, follow these steps:\nConfigure the MLflow tracking server. Set up the MLflow tracking server to register the model in the code used to train it outside of Databricks.\nModify and reorder the artifact's folder in such a way that it matches the structure of a Hugging Face model.\nUse MLflow logging. Use\nmlflow..log_model\nto log the model, which automatically handles the artifact structure.\nThe typical structure of a Hugging Face model includes:\nconfig.json\n: Contains the model configuration\npytorch_model.bin\n: The model weights\ntokenizer.json\nor other tokenizer files: For text processing\nREADME.md\n: A model card describing the model's purpose and usage\nFor more details on model structure and creating custom models compatible with the Hugging Face ecosystem, review the\nCreate a custom architecture\ndocumentation.\nBest practices\nDatabricks recommends the following best practices while creating the models:\nEnsure proper artifact structure. Before registering the model within a Databricks notebook, verify that the artifact structure aligns with Databricks' expectations.\nUnderstand model signatures. Familiarise yourself with\ninfer_signature\nor\nModelSignature\nmethods to properly define input and output schemas for your models\nEnsure you are familiar with the\nMLflow Python API\ndocumentation.\nFor more information, review the\nTrack model development using MLflow\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/tag-removal-is-not-propagated-on-compute-restart.json b/scraped_kb_articles/tag-removal-is-not-propagated-on-compute-restart.json
new file mode 100644
index 0000000000000000000000000000000000000000..591a6b80362eda40b00b10b52cc1661a61019c9d
--- /dev/null
+++ b/scraped_kb_articles/tag-removal-is-not-propagated-on-compute-restart.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/clusters/tag-removal-is-not-propagated-on-compute-restart",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nThe tags you remove from your Azure cloud resource do not propagate correctly to your workspace compute resources. The tag is still present in the workspace compute even after a restart.\nCause\nThere is a synchronization failure between the cloud resource and the workspace compute.\nSolution\nManually edit your compute resource. Perform a minor, temporary change such as adding a dummy tag or a small change on the cluster name.\nNavigate to\nCompute\nin the sidebar.\nSelect the impacted cluster.\nClick\nEdit\n.\nTo make a tag change, scroll down to the\nTags\nsection.\nAdd a temporary tag or change the value of an existing one and then change it back.\nClick Confirm at the bottom of the page.\nAlternatively, to make a name change, in step 4 instead of scrolling to the\nTags\nsection, click in the\nCompute Name\nfield and make a minor change, such as an extra space at the end. Continue with step 6.\nThe compute-side change forces the tags to synchronize with the cloud resource and reflect the correct tags."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/tag-update-failure-on-serving-endpoint.json b/scraped_kb_articles/tag-update-failure-on-serving-endpoint.json
new file mode 100644
index 0000000000000000000000000000000000000000..d78966846ceeb771e4527286bfa2d4bf6887ca2c
--- /dev/null
+++ b/scraped_kb_articles/tag-update-failure-on-serving-endpoint.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/machine-learning/tag-update-failure-on-serving-endpoint",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are attempting to programmatically update tags for a serving endpoint using the Python SDK on any cluster. When you use the\nw.serving_endpoints.patch()\ncall you receive the following error.\nAttributeError: 'str' object has no attribute 'get'\nCause\nThe default\ndatabricks-sdk\nversion (0.20.0) pre-installed on compute lacks complete implementation of the\npatch()\nmethod for serving endpoints. While the method is present, it does not correctly handle the API response format returned, leading to failures when attempting to update tags.\nSolution\nUpgrade the\ndatabricks-sdk\nto version 0.53.0 or above.\n!pip install --upgrade databricks-sdk\r\ndbutils.library.restartPython()\nAfter upgrading, use the\nEndpointTag\nclass to define tags and pass them to the\nadd_tags\nparameter in the\npatch()\nmethod of the\nserving_endpoints\nclient.\nfrom databricks.sdk import WorkspaceClient\r\nfrom databricks.sdk.service.serving import EndpointTag\r\n\r\nw = WorkspaceClient()\r\ntags_to_add = [\r\n    EndpointTag(key=\"\", value=\"\"),\r\n    EndpointTag(key=\"\", value=\"\"),\r\n]\r\n\r\nupdated_tags = w.serving_endpoints.patch(\r\n    name=\"\",\r\n    add_tags=tags_to_add,\r\n)\r\nprint(\"Endpoint now has these tags:\", updated_tags)\nFor more information on the\npatch()\nmethod, review the\nw.serving_endpoints: Serving endpoints\ndocumentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/task-deserialization-time-high.json b/scraped_kb_articles/task-deserialization-time-high.json
new file mode 100644
index 0000000000000000000000000000000000000000..acc5655d5f912d080cfe02c4d69f771d5270f09a
--- /dev/null
+++ b/scraped_kb_articles/task-deserialization-time-high.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/jobs/task-deserialization-time-high",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYour tasks are running slower than expected.\nYou review the stage details in the\nSpark UI\non your cluster and see that task deserialization time is high.\nCause\nCluster-installed libraries (\nAWS\n|\nAzure\n|\nGCP\n) are only installed on the driver when the cluster is started. These libraries are only installed on the executors when the first tasks are submitted. The time taken to install the PyPI libraries is included in the task deserialization time.\nDelete\nInfo\nLibrary installation only occurs on an executor where a task is launched. If a second executor is given a task, the installation process is repeated. The more libraries you have installed, the more noticeable the delay time when a new executor is launched.\nSolution\nIf you are using a large number of PyPI libraries, you should configure your cluster to install the libraries on all the executors when the cluster is started. This results in a slight increase to the cluster launch time, but allows your job tasks to run faster because you don’t have to wait for libraries to install on the executors after the initial launch.\nAdd\nspark.databricks.libraries.enableSparkPyPI false\nto the cluster’s\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n) and restart the cluster."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/tensorflow-fails-to-import.json b/scraped_kb_articles/tensorflow-fails-to-import.json
new file mode 100644
index 0000000000000000000000000000000000000000..7e9fd7d4e3ae222c4e8212c6eacd32ae08c78ee5
--- /dev/null
+++ b/scraped_kb_articles/tensorflow-fails-to-import.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/libraries/tensorflow-fails-to-import",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou have\nTensorFlow\ninstalled on your cluster.\nWhen you try to import\nTensorFlow\n, it fails with an\nInvalid Syntax\nor\nimport error\n.\nCause\nThe version of\nprotobuf\ninstalled on your cluster is not compatible with your version of\nTensorFlow\n.\nSolution\nUse a cluster-scoped init script to install\nTensorFlow\nwith matching versions of\nNumPy\nand\nprotobuf\n.\nCreate the init script.\n%python\r\n\r\ndbutils.fs.put(\"/databricks//install-tensorflow.sh\",\"\"\"\r\n#!/bin/bash\r\nset -e\r\n/databricks/python/bin/python -V\r\n/databricks/python/bin/pip install tensorflow protobuf==3.17.3 numpy==1.15.0\r\n\"\"\", True)\nInstall the init script that you just created as a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nYou will need the full path to the location of the script (\ndbfs:/databricks//install-tensorflow.sh\n).\nRestart the cluster after you have installed the init script.\nDelete\nInfo\nUninstall all existing versions of\nNumPy\nbefore installing the init script on your cluster."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/termination-reasons.json b/scraped_kb_articles/termination-reasons.json
new file mode 100644
index 0000000000000000000000000000000000000000..915896332fc3687d2a7af4bf59477c7f16310df9
--- /dev/null
+++ b/scraped_kb_articles/termination-reasons.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/clusters/termination-reasons",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Sometimes a cluster is terminated unexpectedly, not as a result of a\nmanual termination\nor a configured\nautomatic termination\n. A cluster can be terminated for many reasons. Some terminations are initiated by Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for remediation.\nDatabricks initiated request limit exceeded\nTo defend against API abuses, ensure quality of service, and prevent you from accidentally creating too many large clusters, Databricks throttles all cluster up-sizing requests, including cluster creation, starting, and resizing. The throttling uses the\ntoken bucket algorithm\nto limit the total number of nodes that anyone can launch over a defined interval across your Databricks deployment, while allowing burst requests of certain sizes. Requests coming from both the web UI and the APIs are subject to rate limiting. When cluster requests exceed rate limits, the limit-exceeding request fails with a\nREQUEST_LIMIT_EXCEEDED\nerror.\nSolution\nIf you hit the limit for your legitimate workflow, Databricks recommends that you do the following:\nRetry your request a few minutes later.\nSpread out your recurring workflow evenly in the planned time frame. For example, instead of scheduling all of your\njobs\nto run at an hourly boundary, try distributing them at different intervals within the hour.\nConsider using clusters with a larger\nnode type\nand smaller number of nodes.\nUse\nautoscaling\nclusters.\nIf these options don’t work for you, contact Databricks Support to request a limit increase for the core instance.\nFor other Databricks initiated termination reasons, see\nTermination Code\n.\nCloud provider initiated terminations\nThis article lists common cloud provider related termination reasons and remediation steps.\nAWS\nProvider limit\nDatabricks launches a cluster by requesting resources on behalf of your cloud account. Sometimes, these requests fail because they would exceed your cloud account’s resource limits. In AWS, common error codes include:\nInstanceLimitExceeded\nAWS limits the number of running instances for each node type. Possible solutions include:\nRequest a cluster with fewer nodes.\nRequest a cluster with a different node type.\nAsk AWS support to\nincrease instance limits\n.\nClient.VolumeLimitExceeded\nThe cluster creation request exceeded the\nEBS volume\nlimit. AWS has two types of volume limits: a limit on the total number of EBS volumes, and a limit on the total storage size of EBS volumes. Potential remediation steps:\nRequest a cluster with fewer nodes.\nCheck which of the two limits was exceeded. (\nAWS trusted advisor\nshows service limits for free). If the request exceeded the total number of EBS volumes, try reducing the requested number of volumes per node. If the request exceeded the total EBS storage size, try reducing the requested storage size and/or the number of EBS volumes.\nAsk AWS support to\nincrease EBS volume limits\n.\nRequestLimitExceeded\nAWS\nlimits the rate of API requests\nmade for an AWS account. Wait a while before retrying the request.\nProvider shutdown\nThe Spark driver is a single point of failure because it holds all cluster state. If the instance hosting the driver node is shut down, Databricks terminates the cluster. In AWS, common error codes include:\nClient.UserInitiatedShutdown\nInstance was terminated by a direct request to AWS which did not originate from Databricks. Contact your AWS administrator for more details.\nServer.InsufficientInstanceCapacity\nAWS could not satisfy the instance request. Wait a while and retry the request. Contact AWS support if the problem persists.\nServer.SpotInstanceTermination\nInstance was terminated by AWS because the current spot price has exceeded the maximum bid made for this instance. Use an on-demand instance for the driver, choose a different availability zone, or specify a higher spot bid price.\nFor other shutdown-related error codes, refer to\nAWS docs\n.\nDelete\nLaunch failure\nAWS\nIn AWS, common error codes include:\nUnauthorizedOperation\nDatabricks was not authorized to launch the requested instances. Possible reasons include:\nYour AWS administrator invalidated the AWS access key or IAM role used to launch instances.\nYou are trying to launch a cluster using an IAM role that Databricks does not have permission to use. Contact the AWS administrator who set up the IAM role. For more information, see\nSecure Access to S3 Buckets Using IAM Roles\n.\nUnsupported with message “EBS-optimized instances are not supported for your requested configuration”\nThe selected instance type is not available in the selected availability zone (AZ). It does not actually have anything to do with EBS-optimization being enabled. To remediate, you can choose a different instance type or AZ.\nAuthFailure.ServiceLinkedRoleCreationNotPermitted\nThe provided credentials do not have permission to create the service-linked role for EC2 spot instances. The Databricks administrator needs to update the credentials used to launch instances in your account. Instructions and the updated policy can be found\nAWS Account\n.\nSee\nError Codes\nfor a complete list of AWS error codes.\nDelete\nAzure\nThis termination reason occurs when Azure Databricks fails to acquire virtual machines. The error code and message from the API are propagated to help you troubleshoot the issue.\nOperationNotAllowed\nYou have reached a quota limit, usually number of cores, that your subscription can launch. Request a limit increase in Azure portal. See\nAzure subscription and service limits, quotas, and constraints\n.\nPublicIPCountLimitReached\nYou have reached the limit of the public IPs that you can have running. Request a limit increase in Azure Portal.\nSkuNotAvailable\nThe resource SKU you have selected (such as VM size) is not available for the location you have selected. To resolve, see\nResolve errors for SKU not available\n.\nReadOnlyDisabledSubscription\nYour subscription was disabled. Follow the steps in\nWhy is my Azure subscription disabled and how do I reactivate it?\nto reactivate your subscription.\nResourceGroupBeingDeleted\nCan occur if someone cancels your Azure Databricks workspace in the Azure portal and you try to create a cluster at the same time. The cluster fails because the resource group is being deleted.\nSubscriptionRequestsThrottled\nYour subscription is hitting the Azure Resource Manager request limit (see\nThrottling Resource Manager requests\n). Typical cause is that another system outside Azure Databricks) making a lot of API calls to Azure. Contact Azure support to identify this system and then reduce the number of API calls.\nDelete\nCommunication lost\nDatabricks was able to launch the cluster, but lost the connection to the instance hosting the Spark driver.\nAWS\nCaused by an incorrect networking configuration (for example, changing security group settings for Databricks workers) or a transient AWS networking issue.\nDelete\nAzure\nCaused by the driver virtual machine going down or a networking issue.\nDelete"
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/terraform-registry-does-not-have-a-provider-error.json b/scraped_kb_articles/terraform-registry-does-not-have-a-provider-error.json
new file mode 100644
index 0000000000000000000000000000000000000000..d062df6ce5edfc7790343d3b4337d5168d75278a
--- /dev/null
+++ b/scraped_kb_articles/terraform-registry-does-not-have-a-provider-error.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/terraform/terraform-registry-does-not-have-a-provider-error",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are installing the Databricks Terraform provider (\nAWS\n|\nAzure\n|\nGCP\n) and get a Databricks\nprovider registry\nerror.\nError while installing hashicorp/databricks: provider registry\r\nregistry.terraform.io does not have a provider named\r\nregistry.terraform.io/hashicorp/databricks\nCause\nThis error occurs when the\nrequired_providers\nblock is not defined in every module that uses the Databricks Terraform provider.\nSolution\nCreate a\nversions.tf\nfile with the following contents:\n# versions.tf\r\nterraform {\r\n  required_providers {\r\n    databricks = {\r\n      source  = \"databricks/databricks\"\r\n      version = \"1.0.0\"\r\n    }\r\n  }\r\n}\nSave a copy of this\nversion.tf\nfile in every module in the\nenvironments\nlevel of your code base.\nRemove the\nversion\nfield from the\nversions.tf\nfile and save a copy of the updated file in every module in the\nmodules\nlevel of your code base.\nFor example:\n├── environments\r\n│   ├── sandbox\r\n│   │   ├── README.md\r\n│   │   ├── main.tf\r\n│   │   └── versions.tf   // This file contains the \"version\" field.\r\n│   └── production\r\n│       ├── README.md\r\n│       ├── main.tf\r\n│       └── versions.tf   // This file contains the \"version\" field.\r\n└── modules\r\n    ├── first-module\r\n    │   ├── ...\r\n    │   └── versions.tf   // This file does NOT contain the \"version\" field.\r\n    └── second-module\r\n        ├── ...\r\n        └── versions.tf   // This file does NOT contain the \"version\" field.\nReview the\nRequiring providers\nTerraform documentation for more information."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/the-dbt-option-is-missing-from-menu-options-during-job-creation.json b/scraped_kb_articles/the-dbt-option-is-missing-from-menu-options-during-job-creation.json
new file mode 100644
index 0000000000000000000000000000000000000000..f9bad7f26fbd9a34901550b8288c40be415b1ec6
--- /dev/null
+++ b/scraped_kb_articles/the-dbt-option-is-missing-from-menu-options-during-job-creation.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/jobs/the-dbt-option-is-missing-from-menu-options-during-job-creation",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen you navigate to\nJobs & Pipelines\nto create a new job, you notice the dbt (data build tool) is missing in the\nType\nfield menu options.\nAlso, when you attempt to use dbt through the REST API, you receive an error.\n\"error_code\": \"FEATURE_DISABLED\",\r\n\"message\": \"DBT task feature is not enabled.\"\nCause\ndbt relies on Git repository integrations to manage workflows. Without these integrations, task-related operations tied to configuration or dependencies are stopped.\nYour Git repository integration is either:\nDisabled due to intentional admin-level configurations.\nMissing the enablement step done during workspace creation.\nSolution\nFirst, confirm your workspace is either the premium or enterprise tier. dbt tasks are only available for these tiers.\nIf you are on the premium or enterprise tier and still don’t see this option, have an administrator enable the Git folder feature using the following Databricks REST API command. They can refer to the\nEnable or disable the Databricks Git folder feature\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information.\ncurl -X PATCH \\\\\r\n  -H \"Authorization: Bearer \" \\\\\r\n  -H \"Content-Type: application/json\" \\\\\r\n  -d '{\r\n         \"enableProjectTypeInWorkspace\": true\r\n      }' \\\\\r\n  https:///api/2.0/workspace-conf\nAlternatively, the workspace admin can integrate using SDKs or enable the Git folder feature by running a pre-built Databricks notebook\nturn-on-repos-refresh\n.\nThen, check that you can see the Git folder feature in your workspace. From the main view, navigate to\nAdmin Console\n>\nWorkspace admin\n>\nDevelopment\nand confirm Git folder is enabled under\nRepos\n.\nLast, check to ensure the dbt task is now available during job creation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/the-deltaretentiondurationcheck-property-is-not-recognized-when-using-serverless-compute.json b/scraped_kb_articles/the-deltaretentiondurationcheck-property-is-not-recognized-when-using-serverless-compute.json
new file mode 100644
index 0000000000000000000000000000000000000000..647b0b390c9644b89b40af27585a14dfbdf77abd
--- /dev/null
+++ b/scraped_kb_articles/the-deltaretentiondurationcheck-property-is-not-recognized-when-using-serverless-compute.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta/the-deltaretentiondurationcheck-property-is-not-recognized-when-using-serverless-compute",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are trying to migrate to serverless compute but you are encountering an issue with the Apache Spark\ndelta.retentionDurationCheck\nproperty not working correctly.\nFor example, this sample code snippet does not work when you are using serverless compute:\nspark.sql(\"SET spark.databricks.delta.retentionDurationCheck.enabled=false\")\r\nspark.sql(\"VACUUM .. RETAIN 24 HOURS\")\nYou want to maintain the retention rules that you have defined in your notebook and use serverless compute.\nCause\nDatabricks serverless compute does not support certain Spark properties, including\nspark.databricks.delta.retentionDurationCheck.enabled\n.\nServerless architecture is designed to optimize resource usage and scalability, but it also restricts certain configurations that are available with standard compute.\nUsers cannot directly set these properties in serverless environments, which can be a challenge when maintaining specific retention rules for Delta tables.\nSolution\nUse VACUUM instead of disabling the retention check\nFor example, if you want to use the recommended retention period of 7 days, use the following command in your notebook:\nspark.sql(\"VACUUM .. RETAIN 168 HOURS\")  # 168 hours = 7 days\nFor more information, review the\nRemove unused data files with vacuum\n(\nAWS\n|\nAzure\n) documentation.\nUse table properties to set a specific retention period, especially if you require a short duration\nUse the following command in your notebook:\nALTER TABLE .. SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 24 hours')\nFor a list of supported Spark configuration parameters in serverless clusters, review the\nServerless compute release notes\n(\nAWS\n|\nAzure\n).\nYou can effectively manage retention rules in serverless compute without relying on an unsupported Spark configuration."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/the-sql-statement-to-create-a-table-in-delta-live-tables-dlt-is-ignored-without-errors.json b/scraped_kb_articles/the-sql-statement-to-create-a-table-in-delta-live-tables-dlt-is-ignored-without-errors.json
new file mode 100644
index 0000000000000000000000000000000000000000..bcca460682414103c166c965d07cd3750363d8c3
--- /dev/null
+++ b/scraped_kb_articles/the-sql-statement-to-create-a-table-in-delta-live-tables-dlt-is-ignored-without-errors.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta-live-tables/the-sql-statement-to-create-a-table-in-delta-live-tables-dlt-is-ignored-without-errors",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou’re attempting to add a materialized view to an existing Delta Live Table (DLT) pipeline written in Python. You add an SQL statement to the notebook to create a new table and notice the SQL statement is ignored.\nCause\nDLT notebooks can only contain either Python code or SQL statements, but not both. When a notebook contains both, the SQL statement is ignored.\nSolution\nKeep your existing DLT pipeline for your Python notebook, and set up a separate pipeline and notebook for SQL statements.\nCreate a new DLT pipeline with your SQL notebook.\nMove the SQL statement to the new notebook.\nRun the new pipeline to create the materialized view.\nWhen you’re done, verify that the new table is created in the expected location and your materialized view is updated correctly."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/the-written-metric-in-delta-live-tables-does-not-match-the-number-of-rows-in-the-target-table.json b/scraped_kb_articles/the-written-metric-in-delta-live-tables-does-not-match-the-number-of-rows-in-the-target-table.json
new file mode 100644
index 0000000000000000000000000000000000000000..24c012926464759114e772beb29c0990c726eb9a
--- /dev/null
+++ b/scraped_kb_articles/the-written-metric-in-delta-live-tables-does-not-match-the-number-of-rows-in-the-target-table.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta-live-tables/the-written-metric-in-delta-live-tables-does-not-match-the-number-of-rows-in-the-target-table",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are streaming from Delta Live Tables when you notice that the\nWritten\nmetric in the\nData Quality\ntab shows a different number of records than the actual number of rows in the target table. The\nData Quality\ntab can be found in the settings of the DLT pipeline used for the run.\nIf you use\nSELECT count(*)\nto query the number of rows in the target table, the result can be lower than what is displayed in the\nWritten\nmetric.\nThe issue can be observed in the Databricks workspace and the problem is not specific to any particular configuration or setting but may be more prevalent in streaming use cases.\nCause\nThe\nWritten\nmetric in the\nData Quality\ntab represents the number of records that have been processed and attempted to be written to the target table. This includes all records that have passed through the pipeline, regardless of whether they were ultimately inserted into the table.\nThe discrepancy between the\nWritten\nmetric and the actual number of rows in the table can be attributed to various factors, such as:\nDuplicate records being processed and written, but later dropped or deleted due to constraints or data quality rules defined in the DLT pipeline.\nRecords being ignored or not stored due to failing validation or other related reasons.\nThe\nWritten\nmetric counting records that were written to the table but later rolled back due to transactional failures or other issues.\nSolution\nIt is important to understand that the\nWritten\nmetric includes all records that have been processed, regardless of whether they were ultimately inserted into the table. It is not a count of the actual number of rows in the target table.\nThere are steps you can take to track the amount of data being written to a table.\nUtilize the DLT expectations feature to validate row counts across tables in the pipeline. This can help you get a better understanding of how much data is being written to the table at each stage in the DLT pipeline.\nReview the data quality rules and constraints defined in the DLT pipeline to ensure that they are not causing records to be dropped or ignored unnecessarily.\nMonitor transactions and rollbacks to ensure that records are not being written to the table, as they could have been rolled back due to transactional failures or other issues.\nFor more information on the expectations feature, review the\nWhat are Delta Live Tables expectations?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/time-travel-select-query-works-on-older-dates-even-after-vacuum.json b/scraped_kb_articles/time-travel-select-query-works-on-older-dates-even-after-vacuum.json
new file mode 100644
index 0000000000000000000000000000000000000000..9b949b08532204f0236153407789f90b4c0ff6d1
--- /dev/null
+++ b/scraped_kb_articles/time-travel-select-query-works-on-older-dates-even-after-vacuum.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta/time-travel-select-query-works-on-older-dates-even-after-vacuum",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nAfter you run\nVACUUM\nsuccessfully, you notice you can still access and query data further back than the default seven days of retained history with time travel.\nCause\nWhen you run\nVACUUM\n,  it removes stale files from the file system but it may not remove all files immediately. When there are still active files at a particular version, time travel can still access those files even though they are beyond the expected retention period.\nSolution\nThis is expected\nVACUUM\nbehavior. To be sure\nVACUUM\ndid execute when you ran the command, you can test by reading a table’s state as of\n\"\"\nand writing it to a temporary table using the following query.\nINSERT OVERWRITE  AS SELECT * FROM .. TIMESTAMP AS OF ''\nThe query should fail, because it tries to read every Parquet data file at that state, and if some files are already removed by\nVACUUM\n, it can’t read those files.\nAdditionally, you can check the\nVACUUM\nhistory to see if it has indeed deleted the stale files from storage as expected.\nFor more information, review the\nWork with Delta Lake table history\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nVACUUM\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/time-zone-conversion-is-not-visibly-applied-when-using-display-on-timezone-aware-pandas-datetime-columns.json b/scraped_kb_articles/time-zone-conversion-is-not-visibly-applied-when-using-display-on-timezone-aware-pandas-datetime-columns.json
new file mode 100644
index 0000000000000000000000000000000000000000..0930ae61bcf2a962a426911ab582868b445b6bf2
--- /dev/null
+++ b/scraped_kb_articles/time-zone-conversion-is-not-visibly-applied-when-using-display-on-timezone-aware-pandas-datetime-columns.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/machine-learning/time-zone-conversion-is-not-visibly-applied-when-using-display-on-timezone-aware-pandas-datetime-columns",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen working in a notebook, timezone-aware\ndatetime64[ns, tz]\ncolumns in a pandas DataFrame do not reflect the expected timezone after conversion when rendered using the\ndisplay()\nfunction. Although the conversion is correctly applied in memory, the displayed output remains in UTC.\nCause\nDatabricks'\ndisplay()\nfunction leverages Apache Spark’s rendering behavior, which by default uses the session time zone (usually UTC unless explicitly configured).\nSolution\nExplicitly format timezone-aware datetime columns as strings using\n.strftime('%Y-%m-%d %H:%M:%S%z')\nbefore calling\ndisplay()\n. This ensures the output includes the numeric offset.\nimport pandas as pd\r\n\r\n# Sample data\r\ndata = {\r\n    'datetime': [\r\n        '2025-05-28 08:00:00',\r\n        '2025-05-28 12:30:00',\r\n        '2025-05-28 16:45:00',\r\n        '2025-05-29 00:15:00',\r\n        '2025-05-29 04:00:00',\r\n    ]\r\n}\r\n\r\ndf = pd.DataFrame(data)\r\n\r\n# Convert to datetime, localize to base timezone (such as UTC), and convert to desired timezone\r\ndt_base = pd.to_datetime(df['datetime']).dt.tz_localize('UTC')\r\ndt_converted = dt_base.dt.tz_convert('')  # Replace with desired time zone\r\n\r\n# Format to string with offset\r\ndf['start_time_base'] = dt_base.dt.strftime('%Y-%m-%d %H:%M:%S%z')\r\ndf['start_time_converted'] = dt_converted.dt.strftime('%Y-%m-%d %H:%M:%S%z')\r\n\r\n# Display formatted output\r\ndisplay(df[['datetime', 'start_time_base', 'start_time_converted']])\nIf you’re using Spark DataFrames or SQL, and you want to consistently render all times in a specific timezone, you can optionally configure the session timezone explicitly with its TZ identifier. For example, “\nAustralia/Sydney”\n.\nFor a complete timezone database, refer to\nTime Zone Database\n.\nPython\nspark.conf.set(\"spark.sql.session.timeZone\", \"\")\nSQL\nSET TIME ZONE '';"
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/time-zones-converted-from-a-local-zone-to-utc-and-back-not-reverting-to-original-values-in-apache-spark-and-sql-warehouse.json b/scraped_kb_articles/time-zones-converted-from-a-local-zone-to-utc-and-back-not-reverting-to-original-values-in-apache-spark-and-sql-warehouse.json
new file mode 100644
index 0000000000000000000000000000000000000000..dcd58c77964ecbd080a4f3d2b2ab97fc9eac7292
--- /dev/null
+++ b/scraped_kb_articles/time-zones-converted-from-a-local-zone-to-utc-and-back-not-reverting-to-original-values-in-apache-spark-and-sql-warehouse.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/sql/time-zones-converted-from-a-local-zone-to-utc-and-back-not-reverting-to-original-values-in-apache-spark-and-sql-warehouse",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen using Apache Spark and SQL Warehouse, you encounter time zone conversion discrepancies. For example, when converting timestamps from the 'Australia/Sydney' time zone to UTC and then back to 'Australia/Sydney', the resulting timestamps may not match the original values.\nCause\nThe configuration parameter\nspark.sql.datetime.java8API.enabled\nis enabled by default in SQL Warehouse but not in interactive clusters. When not enabled, this parameter affects how timestamps are handled and converted, leading to inconsistencies.\nAdditionally, historical changes in time zone offsets, such as the shift from UTC+10:05 to UTC+10:00 in 1896, contribute to the observed discrepancies.\nSolution\nNavigate to your cluster settings.\nUnder the\nAdvanced options > Spark\ntab, enter\nspark.sql.datetime.java8API.enabled True\nin the\nSpark config\nbox.\nTest the timestamp conversion again to ensure that the issue is resolved.\nAlternatively, you can set the parameter in a notebook. Run the following code.\nspark.conf.set(\"spark.sql.datetime.java8API.enabled\", \"true\")\nIf the issue persists, verify that the configuration change has been applied correctly and that there are no other conflicting settings.\nFor further reference, consult the\nDates and timestamps\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/timeout-error-when-integrating-kafka-with-apache-spark-structured-streaming.json b/scraped_kb_articles/timeout-error-when-integrating-kafka-with-apache-spark-structured-streaming.json
new file mode 100644
index 0000000000000000000000000000000000000000..0563457c2a99c8902fd1729d250f46e42c64ed82
--- /dev/null
+++ b/scraped_kb_articles/timeout-error-when-integrating-kafka-with-apache-spark-structured-streaming.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/streaming/timeout-error-when-integrating-kafka-with-apache-spark-structured-streaming",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen integrating Kafka with Apache Spark Structured Streaming in Databricks, you encounter a timeout error, such as the following example.\nERROR KafkaOffsetReaderAdmin: Error in attempt 1 getting Kafka offsets:  \r\njava.util.concurrent.ExecutionException: kafkashaded.org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\nCause\nYou’re using Spark's Kafka connector (\nspark-sql-kafka-0-10\n) with Kafka brokers older than version 0.10.0.\nThe\nspark-sql-kafka-0-10\nconnector used in Structured Streaming requires Kafka brokers 0.10.0 or above. It initiates an\nApiVersionRequest\nhandshake to determine broker capabilities.\nFurther, broker versions below 0.10.0 (such as 0.8.2) do not support\nApiVersionRequest\n. The broker either ignores the request or closes the connection, causing client timeouts during metadata operations like\ndescribeTopics\n.\nSolution\nDatabricks recommends upgrading your Kafka brokers to 0.10.0 or above to enable\nApiVersionRequest\nsupport. This aligns with Spark's protocol requirements.\nIf you are unable to upgrade, you can use the legacy Spark Connector. Replace\nspark-sql-kafka-0-10\nwith the older\nspark-streaming-kafka-0-8\nconnector."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/timestamp-change-to-underlying-apache-parquetchange-data-files-while-using-change-data-capture-cdc.json b/scraped_kb_articles/timestamp-change-to-underlying-apache-parquetchange-data-files-while-using-change-data-capture-cdc.json
new file mode 100644
index 0000000000000000000000000000000000000000..63ab85c6feb9f21573ae271ae19755e41b899930
--- /dev/null
+++ b/scraped_kb_articles/timestamp-change-to-underlying-apache-parquetchange-data-files-while-using-change-data-capture-cdc.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta/timestamp-change-to-underlying-apache-parquetchange-data-files-while-using-change-data-capture-cdc",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen using Change Data Capture (CDC) to consume incremental data from an external table, you see that the underlying Apache Parquet or change data files’ timestamp changes.\nThis problem typically arises after moving the files to a different workspace and changing the underlying S3 bucket for the table. Despite copying all S3 files, including change data, to a new location and recreating the external table, CDC queries based on timestamp fail, while queries based on version number succeed.\nExample of a failing query\nselect * from table_changes('db.schema.table','2024-05-30T18:09:55.000-04:00')\nExample of a successful query\nselect * from table_changes('db.schema.table',4129)\nBoth queries are intended to retrieve the same change data.\nCause\nDelta Lake uses the file modification time to determine the timestamp of a commit.\nWhen files are copied to a new S3 bucket, their timestamps change, and there is no option to preserve the original timestamps. As a result, CDC queries based on timestamp fail because they rely on the physical timestamp of the files, which no longer matches the original commit times. In contrast, version-based queries succeed because the\ndelta_log\nversions remain consistent, regardless of the file timestamps.\nDelta Lake's behavior is documented in\na GitHub issue\n.\nImportant\nCloning a Delta table creates a separate history, affecting time travel queries and change data feed. For more information, please refer to the\nClone a table on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) and\nUse Delta Lake change data feed on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nFor timestamp-based queries, ensure that the original file timestamps are preserved during the migration process. If this is not possible, rely on version-based queries to retrieve change data.\nMonitor the development of the in-commit timestamp feature, which is currently in preview. This feature aims to address the issue by using commit timestamps instead of file modification times. You can contact your account team to sign up for the Databricks private preview to access this feature earlier.\nReview the Delta 4.0 roadmap and plan for its adoption once it becomes generally available in Databricks Runtime 16.x, as it includes enhancements to address this issue."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/to_json-results-in-cannot-use-null-as-map-key-error.json b/scraped_kb_articles/to_json-results-in-cannot-use-null-as-map-key-error.json
new file mode 100644
index 0000000000000000000000000000000000000000..206a1b065b2ddab3baf34dfd30ba4c3e03dc56a3
--- /dev/null
+++ b/scraped_kb_articles/to_json-results-in-cannot-use-null-as-map-key-error.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/sql/to_json-results-in-cannot-use-null-as-map-key-error",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou are using\nto_json()\nto convert data to JSON and you get a\nCannot use null as map key\nerror:\nRuntimeException: Cannot use null as map key.\nCause\nto_json()\nfunction does not support using null values as the input map keys.\nThis example code causes the\nCannot use null as map key\nerror when run, because of the null value used as a map key in the fourth line.\n%sql\r\n\r\nselect\r\n  to_json(\r\n    map(\r\n      1, 'Databricks',\r\n      2, 'Map',\r\n      3, 'Error',\r\n      null, 'Data'\r\n    )\r\n  ) as json;\nSolution\nYou should filter out any null values present in the input data before running\nto_json()\n, or use\nnvl()\nto replace all of the null values with non-null values.\nFilter null values\nConsider this example DataFrame:\n+---+----------+-------+\r\n| Id|     Value|address|\r\n+---+----------+-------+\r\n|  1|Databricks|   null|\r\n|  2|       Map|   null|\r\n|  3|     Error|    xyz|\r\n+---+----------+-------+\nThere are two null values in the example.\nAttempting to use\nto_json()\non this DataFrame will return an error.\nWe can filter the null data by showing only the rows that have non-null values.\nFor example, filtering with\ndf.filter(\"address is not null\").show()\nreturns:\n+---+-----+-------+\r\n| Id|Value|address|\r\n+---+-----+-------+\r\n|  3|Error|    xyz|\r\n+---+-----+-------+\nThis filtered DataFrame does not contain any null values, so it can now be used as an input with\nto_json()\n.\nReplace null values with replacements\nIf you cannot filter out the null values, you can use\nnvl()\nto replace the null values with non-null values.\nThe sample code originally had a null value as the map key for the fourth line. Since that results in an error,\nnvl()\nis used in this updated example to substitute 4 for the null value.\n%sql\r\n\r\nselect\r\n  to_json(\r\n    map(\r\n      1, 'Databricks',\r\n      2, 'Map',\r\n      3, 'Error',\r\n nvl(null, 4), 'Data'\r\n    )\r\n  ) as JSON;"
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/too-many-execution-contexts-are-open-right-now.json b/scraped_kb_articles/too-many-execution-contexts-are-open-right-now.json
new file mode 100644
index 0000000000000000000000000000000000000000..e7ae048568bea1c2faadac6958ecd990e8fe4416
--- /dev/null
+++ b/scraped_kb_articles/too-many-execution-contexts-are-open-right-now.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/notebooks/too-many-execution-contexts-are-open-right-now",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nYou come across the below error message when you try to attach a notebook to a cluster or in a job failure.\nRun result unavailable: job failed with error message Too many execution contexts are open right now.(Limit set currently to 150)\nCause\nDatabricks create an execution context when you attach a notebook to a cluster. The execution context contains the state for a REPL environment for each supported programming language: Python, R, Scala, and SQL.\nThe cluster has a maximum number of 150 execution contexts. 145 are user REPLs, while the remaining five are allocated as internal system REPLs which are reserved contexts for backend operations. Once this threshold is reached, you can no longer attach a notebook to the cluster.\nDelete\nInfo\nIt is not possible to view the current number of execution contexts in use.\nSolution\nMake sure you have not disabled auto-eviction for the cluster in your\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nIf the following line is present in your Spark config, auto-eviction is disabled. Remove this line to re-enable auto-eviction:\nspark.databricks.chauffeur.enableIdleContextTracking false\nBest practices\nUse a job cluster instead of an interactive cluster. A job cluster for each job is the best way to avoid running out of execution contexts. Job clusters should be used for isolation and reliability.\nReduce the number of separate notebooks used to reduce the number of execution contexts required.\nTemporary workaround\nAs a short-term solution, you can use a cluster-scoped init script to increase the execution context limit from 150 to 175.\nDelete\nWarning\nIf you increase the execution context limit, the driver memory pressure is likely to increase. You should not use this as a long-term solution.\nCreate the init script\nRun this sample script in a notebook to create the init script on your cluster.\n%scala\r\n\r\nval initScriptContent = s\"\"\"\r\n|#!/bin/bash\r\n|cat > /databricks/common/conf/set_exec_context_limit.conf << EOL\r\n|{\r\n| rb  = 170\r\n|}\r\n|EOL\r\n\"\"\".stripMargin\r\ndbutils.fs.put(\"dbfs://set_exec_context_limit.sh\", initScriptContent, true)\nRemember the path to the init script. You will need it when configuring your cluster.\nConfigure the init script\nFollow the documentation to configure a cluster-scoped init script (\nAWS\n|\nAzure\n|\nGCP\n).\nSet the\nDestination\nas\nDBFS\nand specify the path to the init script. Use the same path that you used in the sample script.\nAfter configuring the init script, restart the cluster."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/total-size-of-serialized-results-of-tasks-is-larger-than-spark-driver-max-result-size-when-using-odbc-connection.json b/scraped_kb_articles/total-size-of-serialized-results-of-tasks-is-larger-than-spark-driver-max-result-size-when-using-odbc-connection.json
new file mode 100644
index 0000000000000000000000000000000000000000..1c0fb1c5f045e6156a5e459f796d207ea433f359
--- /dev/null
+++ b/scraped_kb_articles/total-size-of-serialized-results-of-tasks-is-larger-than-spark-driver-max-result-size-when-using-odbc-connection.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/data-sources/total-size-of-serialized-results-of-tasks-is-larger-than-spark-driver-max-result-size-when-using-odbc-connection",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Problem\nWhen using Tableau to connect to Databricks and extract data from large tables, the extract refresh process fails with an error message.\nTotal size of serialized results of tasks is bigger than spark.driver.maxResultSize.\nThis issue can occur with any kind of compute, but comes up most often when using serverless compute because the\nspark.driver.maxResultSize\nconfiguration cannot be changed in serverless.\nCause\nThe amount of data being extracted from the Databricks tables exceeds the default\nspark.driver.maxResultSize\nlimit. This can happen when ODBC connection parameters disable Cloud Fetch.\nSolution\nIn your connection parameters, locate and delete the\nEnableQueryResultDownload=0\nparameter to re-enable Cloud Fetch.\nPreventative measures\nOptimize your queries to retrieve only the necessary data from Databricks tables, reducing the amount of data transferred and processed.\nIf using classic compute, you can increase the value of\nspark.driver.maxResultSize\nto a larger limit.\nClick on your compute, and navigate to the\nAdvanced options\nsection.\nClick to expand. Under the\nSpark\ntab, in the\nSpark config\nfield, add the following config.\nspark.driver.maxResultSize \nThe value of\n\ndepends on your driver size and the current value. To check the current value, open the Spark UI, navigate to the\nEnvironment\ntab, and search for the\nspark.driver.maxResultSize\nconfiguration. When making this change, make sure to choose a value that is higher than the current value you see in the Spark UI."
+}
\ No newline at end of file
diff --git a/scraped_kb_articles/track-deleted-files-from-vacuum-in-delta-table-history.json b/scraped_kb_articles/track-deleted-files-from-vacuum-in-delta-table-history.json
new file mode 100644
index 0000000000000000000000000000000000000000..076428c08732ef6957ad942052ba1c92bb60aee3
--- /dev/null
+++ b/scraped_kb_articles/track-deleted-files-from-vacuum-in-delta-table-history.json
@@ -0,0 +1,5 @@
+{
+    "url": "https://kb.databricks.com/en_US/delta/track-deleted-files-from-vacuum-in-delta-table-history",
+    "title": "Título do Artigo Desconhecido",
+    "content": "Introduction\nAfter executing a VACUUM operation on a Delta table, you want to verify how many files were deleted and the total size of the removed data for auditing or optimization validation.\nInstructions\nTo obtain the number of deleted files and the size of deleted data, query the\noperationMetrics\nfield corresponding to the\nVACUUM START\nand\nVACUUM END\nentries in the table’s history.\nSample query\n```sql\r\nSELECT operation, operationParameters, operationMetrics\r\nFROM (DESC HISTORY ``.``.`
`)\r\nWHERE operation IN ('VACUUM START', 'VACUUM END');\r\n```\nThis query returns the start and end events for the\nVACUUM\noperation, along with key metrics such as:\nnumFilesToDelete\n: Number of files to be deleted.\nsizeOfDataToDelete\n: Total size of data to be deleted.\nnumDeletedFiles\n: Number of files deleted.\nBest practices\nAlways check both\nVACUUM START\nand\nVACUUM END\nentries to ensure the operation is completed successfully.\nAutomate logging of\noperationMetrics\nafter\nVACUUM\nto maintain a cleanup audit trail. For instructions, review the\nAutomate VACUUM metrics logging for Delta table cleanup audits\nKB article.\nCombine\nDESCRIBE HISTORY\nwith partition or timestamp filters to narrow down recent maintenance activity." +} \ No newline at end of file diff --git a/scraped_kb_articles/trailing-zeros-in-decimal-values-appear-when-reading-parquet-files-in-apache-spark.json b/scraped_kb_articles/trailing-zeros-in-decimal-values-appear-when-reading-parquet-files-in-apache-spark.json new file mode 100644 index 0000000000000000000000000000000000000000..d145496be711d062aaa962c2d9e22a3c38a10f70 --- /dev/null +++ b/scraped_kb_articles/trailing-zeros-in-decimal-values-appear-when-reading-parquet-files-in-apache-spark.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/trailing-zeros-in-decimal-values-appear-when-reading-parquet-files-in-apache-spark", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen reading Parquet files using\nspark.read.parquet\n, you notice trailing zeros added to decimal values.\nExample\nA source has data (1,50.0); (2,6.2300); (3,4.56). After reading the data in Parquet format, the values appear with trailing zeros.\nid\nvalue\n1\n50.0000\n2\n6.2300\n3\n4.5600\nCause\nApache Spark infers the schema for Parquet tables based on the column values and assigns a consistent scale to all decimal values. Trailing zeros appear to the right of the decimal point after all non-zero digits to ensure scale uniformity across the dataset. In the example in the problem statement, trailing zeros were added to achieve a consistent four places after the decimal.\nThis behavior is by design to maintain consistency and precision when processing decimal data.\nSolution\nTo address the appearance of trailing zeros without altering the underlying data type or precision, use the\nformat_number\nfunction. This function allows you to specify the desired number of decimal places to display.\nThe line\nformat_number(col(\"\"), 2)\nin the following example formats the values in the\ndecimal_column\nto two decimal places. Adjust the second parameter (\n2\n) to control the number of decimal places displayed.\nExample\nfrom pyspark.sql.functions import format_number, col\r\ndf = spark.read.parquet(\"\")\r\ndf = df.withColumn(\"\", format_number(col(\"\"), 2))\nFor more information, review the\nformat_number function\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/trigger-a-job-as-a-specific-user-with-run-as.json b/scraped_kb_articles/trigger-a-job-as-a-specific-user-with-run-as.json new file mode 100644 index 0000000000000000000000000000000000000000..e1c6fc38820bbaa3a9956e7a333891fa1c515401 --- /dev/null +++ b/scraped_kb_articles/trigger-a-job-as-a-specific-user-with-run-as.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/trigger-a-job-as-a-specific-user-with-run-as", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou may encounter difficulties when trying to trigger a job with a specific user ID using the\nRun as\noption. This issue arises when multiple users need to run jobs with their own input values and then later filter the job runs based on the user who initiated them.\nCause\nThe Jobs API does not support changing the\nRun as\nparameter for each job run. This prevents users from specifying their own user ID when triggering a job, making it challenging to filter job runs by the initiating user.\nBy default, the job creator is tagged as the\nRun as\nuser. We can edit the\nRun as\na user to\nrun a job as a service principal\n(\nAWS\n|\nAzure\n|\nGCP\n), but it is a one-time activity. Each run cannot normally have an individual identity.\nSolution\nThe identity assigned should have the following permissions:\nCluster creation (if using the job cluster)\nPermission to execute and run the underlying resources (interactive cluster, notebook, Python file, other job resources)\nThere are two ways to trigger a job with a specific\nRun as\nuser.\nUse a Run Job task in the UI\nClick\nNew\n.\nClick\nJob\n.\nIn the\nTasks\ntab, set\nRun Job\n(\nAWS\n|\nAzure\n|\nGCP\n) as the\nType\n.\nIn the\nJob\ndrop-down menu, select the job you want to execute. This task type uses the same job configuration as the original job but triggers it with the user ID specified in the\nJob details\nfield.\nInfo\nOnly workspace admins can assign a\nRun as\nuser different from themselves.\nUse the API\nUse the\ncreate command\n(\nAWS\n|\nAzure\n|\nGCP\n) in the Jobs API payload to trigger the job with your desired\nRun as\nuser.\nUse the sample code and update the following values:\n\n: The desired job name.\n\n: The desired task name.\n\n: The\njob_id\nto trigger with the new job.\n\n: User name to be assigned.\nThe\nRun as\njob uses the same configuration as the parent job.\nExample code\n{\r\n   \"name\": \"\",\r\n   \"email_notifications\": {\r\n       \"no_alert_for_skipped_runs\": false\r\n   },\r\n   \"webhook_notifications\": {},\r\n   \"timeout_seconds\": 0,\r\n   \"max_concurrent_runs\": 1,\r\n   \"tasks\": [\r\n       {\r\n           \"task_key\": \"\",\r\n           \"run_if\": \"ALL_SUCCESS\",\r\n           \"run_job_task\": {\r\n               \"job_id\": \"\"\r\n           },\r\n           \"timeout_seconds\": 0,\r\n           \"email_notifications\": {}\r\n       }\r\n   ],\r\n   \"queue\": {\r\n       \"enabled\": true\r\n   },\r\n   \"run_as\": {\r\n       \"user_name\": \"\"\r\n   }\r\n}\nThe payload triggers the root job with\njob_id\nas\n\n, using the specified user name. This allows you to filter job runs in the\nRun as\nsection of the\nJob Run\nUI." +} \ No newline at end of file diff --git a/scraped_kb_articles/troubleshoot-cancel-command.json b/scraped_kb_articles/troubleshoot-cancel-command.json new file mode 100644 index 0000000000000000000000000000000000000000..9c9d9a4138eb7f4491262ab054ff9af402ef0630 --- /dev/null +++ b/scraped_kb_articles/troubleshoot-cancel-command.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/troubleshoot-cancel-command", + "title": "Título do Artigo Desconhecido", + "content": "This article provides an overview of troubleshooting steps you can take if a notebook is unresponsive or cancels commands.\nCheck metastore connectivity\nProblem\nSimple commands in newly-attached notebooks fail, but succeed in notebooks that were attached to the same cluster earlier.\nTroubleshooting steps\nCheck metastore connectivity. The inability to connect to the Hive metastore can cause REPL initialization to hang, making the cluster appear unresponsive.\nAre you are using the Databricks metastore or your own external metastore? If you are using an external metastore, have you changed anything recently? Did you upgrade your metastore version? Rotate passwords or configurations? Change security group rules?\nSee\nMetastore\nfor more troubleshooting tips and solutions.\nCheck for conflicting libraries\nProblem\nPython library conflicts can result in cancelled commands. The Databricks support organization sees conflicts most often with versions of\nipython\n,\nnumpy\n,\nscipy\n, and\npandas\n.\nTroubleshooting steps\nReview the\nCluster cancels Python command execution due to library conflict\nKB article for more information.\nFor more notebook troubleshooting information, see\nNotebooks\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/troubleshoot-key-vault-access.json b/scraped_kb_articles/troubleshoot-key-vault-access.json new file mode 100644 index 0000000000000000000000000000000000000000..1e14671581b84a5b9087cbe6717c87d8e92c16ee --- /dev/null +++ b/scraped_kb_articles/troubleshoot-key-vault-access.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/troubleshoot-key-vault-access", + "title": "Título do Artigo Desconhecido", + "content": "You are trying to access secrets, when you get an error message.\ncom.databricks.common.client.DatabricksServiceException: INVALID_STATE: Databricks could not access keyvault: https://xxxxxxx.vault.azure.net/.\nThere is not a single root cause for this error message, so you will have to do some troubleshooting.\nConfirm permissions are correctly set on the key vault\nLoad the Azure Portal.\nOpen\nKey vaults\n.\nClick the key vault.\nClick\nAccess policies\n.\nVerify the\nGet\nand\nList\npermissions are applied.\nInspect the firewall configuration on the key vault\nLoad the Azure Portal.\nOpen\nKey vaults\n.\nClick the key vault.\nClick\nNetworking\n.\nClick\nFirewalls and virtual networks\n.\nSelect\nPrivate endpoint and selected networks\n.\nVerify that\nAllow trusted Microsoft services to bypass this firewall?\nis set to\nYes\n.\nAttempt to access the secrets.\nIf you can view the secrets, the issue is resolved.\nIf you are still getting the\nINVALID_STATE: Databricks could not access keyvault\nerror, continue troubleshooting.\nList all secrets in the secret scope\nOpen a notebook.\nList all secrets in scope.\n%python\r\n\r\ndbutils.secrets.list(\"\")\nTry to access individual secrets\nTry to access a few different, random secrets.\n%python\r\n\r\ndbutils.secrets.get(\"\", \"\")\nIf some secrets can be fetched, while others fail, the failed secrets are either disabled or inactive.\nEnable individual secrets\nLoad the Azure Portal.\nOpen\nKey vaults\n.\nClick the key vault.\nClick\nSecrets\n.\nClick the secret and verify that the status is set to\nEnabled\n.\nIf the secret is disabled, enable it, or create a new version.\nVerify that individual secrets are working\nTry to access the previously failed secrets.\n%python\r\n\r\ndbutils.secrets.get(\"\", \"\")\nYou can fetch all of them." +} \ No newline at end of file diff --git a/scraped_kb_articles/trying-to-decode-a-protocol-buffer-and-getting-error-protobuf_dependency_not_found.json b/scraped_kb_articles/trying-to-decode-a-protocol-buffer-and-getting-error-protobuf_dependency_not_found.json new file mode 100644 index 0000000000000000000000000000000000000000..a73cb457d6c7e061f88d90127ce4cdeb5234d207 --- /dev/null +++ b/scraped_kb_articles/trying-to-decode-a-protocol-buffer-and-getting-error-protobuf_dependency_not_found.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/trying-to-decode-a-protocol-buffer-and-getting-error-protobuf_dependency_not_found", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to decode a protocol buffer (protobuf) message containing timestamp data using the\nfrom_protobuf()\nApache Spark SQL built-in function, you encounter an error message.\n[PROTOBUF_DEPENDENCY_NOT_FOUND] Could not find dependency: google/protobuf/timestamp.proto\nCause\nThe\nfrom_protobuf()\nfunction cannot find the required protobuf dependency\ngoogle/protobuf/timestamp.proto\nbecause the protobuf descriptor file does not include this dependency when it is created.\nSolution\nUse the option\n--include_imports\nwhile creating the protobuf descriptor file, and then use this descriptor file in the\nfrom_protobuf()\nfunction.\nExample\nprotoc --descriptor_set_out=sample.desc --include_imports sample.proto\ndf.select(from_protobuf(\"value\", \"AppEvent\", sample.desc).alias(\"event\"))\nNote\nYou only need an explicit import for\nTimestampType\nand\nDayTimeIntervalType\n.\nTimestamp\nis represented as\n{seconds: Long, nanos: Int}\nand maps to the TimestampType in Spark SQL.\nDuration\nmaps to\nDayTimeIntervalType\nin Spark SQL.\nFor more information on data type mapping, please refer to the\nSpark Protobuf Data Source Guide\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/trying-to-load-an-mlflow-model-using-a-python-script-returns-a-py4jjavaerror-error.json b/scraped_kb_articles/trying-to-load-an-mlflow-model-using-a-python-script-returns-a-py4jjavaerror-error.json new file mode 100644 index 0000000000000000000000000000000000000000..ec576f297b99f526586a36edca46d0bda420d9ef --- /dev/null +++ b/scraped_kb_articles/trying-to-load-an-mlflow-model-using-a-python-script-returns-a-py4jjavaerror-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/trying-to-load-an-mlflow-model-using-a-python-script-returns-a-py4jjavaerror-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile running a job using a Python script to load an MLflow model, you encounter the following error message.\nPy4JJavaError: An error occurred while calling o422.get.: java.util.NoSuchElementException: None.get\nCause\nMLflow has an issue in versions below 2.15.0. It is resolved in 2.15.0 and above.\nSolution\nUpgrade MLflow to version 2.15.0 or above.\nVerify the current MLflow version.\nimport mlflow\r\nprint(mlflow.__version__)\nIf it’s an older version, or you use a Databricks Runtime version that comes with an older version of MLflow, install the latest version of MLflow in your Databricks cluster.\nGo to your workspace and select the cluster where you want to install the latest version of MLflow.\nClick the\nLibraries\ntab and then click\nInstall New\n.\nIn the\nInstall Library\ndialog box, select\nPyPI\nas the library source and enter\nMLflow\nin the\nPackage\nfield. Also make sure to check\nAttach to cluster\n.\nClick\nInstall\nto install the latest version of MLflow." +} \ No newline at end of file diff --git a/scraped_kb_articles/trying-to-perform-write-over-union-all-causes-error.json b/scraped_kb_articles/trying-to-perform-write-over-union-all-causes-error.json new file mode 100644 index 0000000000000000000000000000000000000000..01d33678859a78dccb0e821378e54c8754a1fe94 --- /dev/null +++ b/scraped_kb_articles/trying-to-perform-write-over-union-all-causes-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/trying-to-perform-write-over-union-all-causes-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIn Databricks Runtime 14.3 LTS and below, when you perform\nWRITE\nover\nUNION ALL\n, you receive an error stating the query is not supported by Photon. Upon checking the SQL query plan page in the Apache Spark UI, you see additional information.\n== Photon Explanation ==\r\nPhoton does not fully support the query because:\r\nUnsupported node: Union\nCause\nThe Photon implementation in Databricks Runtime 14.3 LTS and below only runs\nUNION\nin Photon when there is a\nSHUFFLE\nabove the\nUNION\n. If there is\nWRITE\ninstead, the\nUNION\ndoes not run in Photon.\nSolution\nTo enable support for\nWRITE\nover\nUNION ALL\n, use Databricks Runtime 15.4 LTS or above. As of Databricks Runtime 16.1, the ability to perform\nWRITE\nover\nUNION ALL\nis available by default." +} \ No newline at end of file diff --git a/scraped_kb_articles/trying-to-write-excel-files-to-a-unity-catalog-volume-fails.json b/scraped_kb_articles/trying-to-write-excel-files-to-a-unity-catalog-volume-fails.json new file mode 100644 index 0000000000000000000000000000000000000000..3b5b638fc2cf130a6cdef83c489c4eb82f2cacff --- /dev/null +++ b/scraped_kb_articles/trying-to-write-excel-files-to-a-unity-catalog-volume-fails.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/trying-to-write-excel-files-to-a-unity-catalog-volume-fails", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to write a dataFrame as an Excel file to a Unity Catalog volume when it fails with an\nOperation not permitted\nerror message.\nCause\nThis is a known limitation when writing to Unity Catalog volumes.\nAs stated in the Databricks\nVolumes limitations\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation, “Direct-append or non-sequential (random) writes, such as writing Zip and Excel files are not supported.”\nSolution\nYou should perform the write operation on a local disk and copy the results to a Unity Catalog volume. You can use\n/local_disk0/tmp\nfor all cluster types. Other paths may fail.\nExample code\nThese examples show you two possible ways to accomplish this task. The first uses\nopenpyxl\n, while the second does not require any additional package installation.\nUse\nopenpyxl\nwith\n/local_disk0/tmp\nWe create a sample dataframe to be written to a Unity Catalog volume path as an Excel file. The file is written to a temporary local path before it is copied to the volume path. The\nopenpyxl\npackage is used to create the Excel file in the local path. Then we use the\nshutil\ncopyfile\nfunction to copy the file from the local path to the volume.\n%python\r\n%pip install openpyxl\r\ndbutils.library.restartPython()\r\n\r\nimport pandas as pd\r\nfrom openpyxl import Workbook\r\nfrom shutil import copyfile\r\n\r\ndef write_excel(data, filename):\r\n    \"\"\"\r\n    Write data to an Excel (.xlsx) file.\r\n    \r\n    Parameters:\r\n        data (list of lists): The data to write, with the first inner list as the header row.\r\n        filename (str): The filename for the Excel file in local_disk.\r\n    \"\"\"\r\n    wb = Workbook()           # Create a new workbook\r\n    ws = wb.active            # Select the active worksheet\r\n    \r\n    # Iterate over each row in the data and append it to the worksheet\r\n    for row in data:\r\n        ws.append(row)\r\n    \r\n    wb.save(filename)         # Save the workbook to the specified file\r\n\r\n# Example usage:\r\ndata = [\r\n    [\"Name\", \"Age\", \"City\"],  # Header row\r\n    [\"Alice\", 30, \"New York\"],\r\n    [\"Bob\", 25, \"San Francisco\"],\r\n    [\"Charlie\", 35, \"Los Angeles\"]\r\n]\r\n\r\nlocal_file = \"/local_disk0/tmp/excel.xlsx\"\r\nfilename = \r\nprint(filename)\r\nwrite_excel(data, local_file)\r\ncopyfile(local_file, filename)\nNo package installation\nWe create a sample dataframe to be written to a Unity Catalog volumes path as an Excel file. The file is written to a temporary local path before it is copied to the volume path. We use the\npandas\nfunction\n.to_excel()\nto create the Excel file in the local path. Then we use the\nshutil\ncopyfile\nfunction to copy the file from the local path to the volume.\n%python\r\n\r\nimport pandas as pd\r\nfrom shutil import copyfile\r\n\r\ndef write_excel(data, filename):\r\n    \"\"\"\r\n    Write data to an Excel (.xlsx) file.\r\n    \r\n    Parameters:\r\n        data (list of lists): The data to write, with the first inner list as the header row.\r\n        filename (str): The filename for the Excel file.\r\n    \"\"\"\r\n    df = pd.DataFrame(data[1:], columns=data[0])\r\n    df.to_excel('/local_disk0/tmp/excel.xlsx', index=False, sheet_name=\"Sheet1\")\r\n    copyfile('/local_disk0/tmp/excel.xlsx', filename)\r\n\r\n# Example usage:\r\ndata = [\r\n    [\"Name\", \"Age\", \"City\"],  # Header row\r\n    [\"Alice\", 30, \"New York\"],\r\n    [\"Bob\", 25, \"San Francisco\"],\r\n    [\"Charlie\", 35, \"Los Angeles\"]\r\n]\r\n\r\nfilename = \r\nwrite_excel(data, filename)" +} \ No newline at end of file diff --git a/scraped_kb_articles/typeerror-with-an-unexpected-keyword-argument-query_type-when-attempting-to-perform-hybrid-similarity-search-using-the-databricks-vectorsearch-package.json b/scraped_kb_articles/typeerror-with-an-unexpected-keyword-argument-query_type-when-attempting-to-perform-hybrid-similarity-search-using-the-databricks-vectorsearch-package.json new file mode 100644 index 0000000000000000000000000000000000000000..fea897ff55979931023af2447b000836176870b0 --- /dev/null +++ b/scraped_kb_articles/typeerror-with-an-unexpected-keyword-argument-query_type-when-attempting-to-perform-hybrid-similarity-search-using-the-databricks-vectorsearch-package.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/typeerror-with-an-unexpected-keyword-argument-query_type-when-attempting-to-perform-hybrid-similarity-search-using-the-databricks-vectorsearch-package", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the\ndatabricks-vectorsearch\npackage to perform a hybrid keyword-similarity search, you set the\nquery_type\nargument  to\nhybrid\nin the\nsimilarity_search\nfunction and receive an error.\nTypeError: VectorSearchIndex.similarity_search() got an unexpected keyword argument 'query_type'\nCause\nThe\nquery_type\nargument was introduced in version 0.38 of the\ndatabricks-vectorsearch\npackage. Older versions do not support this type of search.\nSolution\nTo resolve the issue, update your\ndatabricks-vectorsearch\npackage.\n!pip install databricks-vectorsearch --force-reinstall\r\ndbutils.library.restartPython()" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-access-azure-databricks-account-as-an-admin.json b/scraped_kb_articles/unable-to-access-azure-databricks-account-as-an-admin.json new file mode 100644 index 0000000000000000000000000000000000000000..054d657cfc927638f9dfc043e51b0a65d542b0b6 --- /dev/null +++ b/scraped_kb_articles/unable-to-access-azure-databricks-account-as-an-admin.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/unable-to-access-azure-databricks-account-as-an-admin", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-access-delta-sharing-tables-with-a-python-client.json b/scraped_kb_articles/unable-to-access-delta-sharing-tables-with-a-python-client.json new file mode 100644 index 0000000000000000000000000000000000000000..e6916b9bfc488117981d0a65be3dd297713dfcf8 --- /dev/null +++ b/scraped_kb_articles/unable-to-access-delta-sharing-tables-with-a-python-client.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unable-to-access-delta-sharing-tables-with-a-python-client", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDelta Sharing\nis a platform independent\nopen protocol\nthat is used to securely share data with other organizations.\nWhen using an open sharing model, recipients can access shared data in a read-only format using the\ndelta-sharing\nPython library.\nWhen trying to access a shared table using any Python client, you get an\nSSLCertVerificationError.\nSSLCertVerificationError                  Traceback (most recent call last)\r\nC:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\r\n    669             # Make the request on the httplib connection object.\r\n--> 670             httplib_response = self._make_request(\r\n    671                 conn,\r\nC:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)\r\n    380         try:\r\n--> 381             self._validate_conn(conn)\r\n    382         except (SocketTimeout, BaseSSLError) as e:\r\nC:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connectionpool.py in _validate_conn(self, conn)\r\n    975         if not getattr(conn, \"sock\", None):  # AppEngine might not have  `.sock`\r\n--> 976             conn.connect()\r\nC:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\connection.py in connect(self)\r\n    360\r\n--> 361         self.sock = ssl_wrap_socket(\r\n    362             sock=conn,\r\nC:\\ProgramData\\Anaconda3\\lib\\site-packages\\urllib3\\util\\ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data)\r\n    376         if HAS_SNI and server_hostname is not None:\r\n--> 377             return context.wrap_socket(sock, server_hostname=server_hostname)\r\nC:\\ProgramData\\Anaconda3\\lib\\ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)\r\n    499         # ctx._wrap_socket()\r\n--> 500         return self.sslsocket_class._create(\r\n    501             sock=sock,\r\nC:\\ProgramData\\Anaconda3\\lib\\ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)\r\n   1039                         raise ValueError(\"do_handshake_on_connect should not be specified for non-blocking sockets\")\r\n-> 1040                     self.do_handshake()\r\n   1041             except (OSError, ValueError):\nCause\nThe client IP address is not whitelisted in the storage account firewall.\nSolution\nIf firewall is enabled on the storage account, ensure that client IP address is whitelisted in the firewall.\nSign in to the\nAzure Portal\n.\nExpand the left sidebar menu and click\nStorage accounts\n. to display a list of your storage accounts. If the portal menu isn't visible, click the menu button to toggle it on.\nClick the name of the storage account you want to edit.\nClick\nNetworking\nunder the\nSecurity + networking\nheader.\nMake sure\nFirewalls and virtual networks\nis selected.\nSelect\nEnabled from selected virtual networks and IP addresses\n.\nUnder\nFirewall\nensure that a check mark appears next to\nAdd your client IP address\n.\nEnter your client IP address in the\nAddress range\nfield.\nSelect\nSave\nto apply your changes.\nWait two minutes to ensure the changes have propagated.\nYou should now be able to access the Delta Sharing tables with your local Python client." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-access-secrets-using-instance-profile-in-shared-access-mode.json b/scraped_kb_articles/unable-to-access-secrets-using-instance-profile-in-shared-access-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..028220b1c46dd1519d9c4bf75e30de82e2462baf --- /dev/null +++ b/scraped_kb_articles/unable-to-access-secrets-using-instance-profile-in-shared-access-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unable-to-access-secrets-using-instance-profile-in-shared-access-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to access secrets using an instance profile with a shared access mode cluster, you encounter the following error message.\nNoCredentialsError: Unable to locate credentials\nCause\nAccessing secrets using an instance profile is not supported in shared access mode, in order to prevent unauthorized access to sensitive information.\nSolution\nUse a single-user access mode cluster instead.\nFor more information, review the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n)" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-access-the-hive_metastore-schema.json b/scraped_kb_articles/unable-to-access-the-hive_metastore-schema.json new file mode 100644 index 0000000000000000000000000000000000000000..da9507bf758e77a27ca9e064acf46ff1b15cec2b --- /dev/null +++ b/scraped_kb_articles/unable-to-access-the-hive_metastore-schema.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unable-to-access-the-hive_metastore-schema", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re trying to access the\nhive_metastore\nschema and receive an error message.\nsummary: Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table . Exception thrown when executing query: SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS NUCLEUS_TYPE, A0.CREATE_TIME, A0.LAST_ACCESS_TIME, A0.OWNER, A0.RETENTION, A0.REWRITE_ENABLED, A0.TBL_NAME, A0.TBL_TYPE, A0.TBL_ID FROM TBLS A0 LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID WHERE A0.TBL_NAME = ? AND B0.`NAME` = ?, data: com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table . Exception thrown when executing\nAdditionally, you may see a\njava.sql.SQLException\nindicating an unknown column\n'A0.REWRITE_ENABLED'\nin the field list.\nCause\nDifferent clusters are configured with different versions of the Hive metastore. For example, one cluster might use\nspark.sql.hive.metastore.version: 2.3.7\nwhile another uses\nspark.sql.hive.metastore.version: 2.3.9\n.\nSpecific Apache Spark configurations may also be missing, such as\nspark.sql.hive.metastore.jars\n.\nSolution\nEnsure that all clusters use the same Hive metastore version. For example, set\nspark.sql.hive.metastore.version\nto\n2.3.7\non all clusters.\nAdd the following configuration to the Spark settings of the affected clusters.\nHive 2.3.7 (Databricks Runtime 7.0 - 9.x) or Hive 2.3.9 (Databricks Runtime 10.0 and above): \r\n\r\n     set spark.sql.hive.metastore.jars to builtin\nFor all other Hive versions, Azure Databricks recommends that you download the metastore JARs and set the configuration\nspark.sql.hive.metastore.jars\nto point to the downloaded JARs. For more information, review the\nExternal Apache Hive metastore (legacy)\ndocumentation.\nAdditionally, ensure that you or a given user have the necessary permissions to access the tables in the Hive metastore. This includes checking permissions on the external Hive metastore side." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-access-unity-catalog-tables-using-power-bi-desktop-or-tableau.json b/scraped_kb_articles/unable-to-access-unity-catalog-tables-using-power-bi-desktop-or-tableau.json new file mode 100644 index 0000000000000000000000000000000000000000..5f7fe98377dc4cd3dbfbd09c3cf102a33cb82214 --- /dev/null +++ b/scraped_kb_articles/unable-to-access-unity-catalog-tables-using-power-bi-desktop-or-tableau.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unable-to-access-unity-catalog-tables-using-power-bi-desktop-or-tableau", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile connecting Databricks with an external ecosystem tool such as Power BI Desktop or Tableau, you notice you can only access Hive metastore (HMS) tables. Unity Catalog tables are not accessible.\nCause\nYou are using No Isolation Shared access mode, which does not support Unity Catalog.\nAlternatively, or additionally, you are using a deprecated Databricks Runtime version, which can cause compatibility issues.\nSolution\nSwitch to a supported access mode that allows Unity Catalog access. Refer to the “Access modes” section of the\nCompute configuration reference\n(\nAWS\n|\nAzure\n) documentation for more information.\nUpgrade to Databricks Runtime 13.3 LTS or above to connect to UC tables.\nAdditionally, create an ODBC DSN for the Databricks ODBC driver. Ensure that the ODBC driver is installed before proceeding with the setup. Follow the instructions in the\nCreate an ODBC DSN for the Databricks ODBC Driver\n(\nAWS\n|\nAzure\n) documentation.\nAfter completing the above steps, test the connection in the external tool to confirm you can see your Unity Catalog tables." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-access-unity-catalog-views.json b/scraped_kb_articles/unable-to-access-unity-catalog-views.json new file mode 100644 index 0000000000000000000000000000000000000000..01c78d10eddcf917808e90f5c450da6139393dee --- /dev/null +++ b/scraped_kb_articles/unable-to-access-unity-catalog-views.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unable-to-access-unity-catalog-views", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nA user is trying to access a view in Unity Catalog when it fails with a\nTable '' does not have sufficient privilege to execute\nerror message.\nError in SQL statement: AnalysisException: Table '' does not have sufficient privilege to execute.\nCause\nThe owner of the view does not have sufficient privileges on the source table.\nSolution\nYou must ensure the owner of the view has enough privileges on the underlying table from which the view was created.\nVerify the owner of the view with\nDESCRIBE TABLE EXTENDED\n. Run\nDESCRIBE TABLE EXTENDED\non the view and look at the results in the\nOwner\ncolumn.\n%sql\r\n\r\nDESCRIBE TABLE EXTENDED ..;\nVerify the view owner has\nUSE CATALOG and USE SCHEMA\npermissions on the catalog and the\nSELECT\nprivilege on the table.\n%sql\r\n\r\nSHOW GRANTS `` on ..;\nYou can also verify the permissions using Data Explorer.\nClick\nData\nto open Data Explorer.\nSelect the catalog.\nClick\nPermissions\n.\nVerify the permissions for the view owner.\nIf the view owner does not have the correct permissions, grant\nUSE CATALOG and USE SCHEMA\non the catalog and\nSELECT\non the table.\n%sql\r\n\r\nGRANT USE CATALOG,USE SCHEMA, SELECT ON CATALOG TO ``;\nYou can also grant permissions using Data Explorer.\nClick\nData\nto open Data Explorer.\nSelect the catalog.\nClick\nPermissions\n.\nClick\nGrant\n.\nSelect the view owner from the\nUsers and groups\ndrop down list.\nSelect\nData Reader\nfrom the\nPrivilege presets\ndrop down list.\nClick\nGrant\n.\nThe view owner should now be able to access the view." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-attach-init-scripts-to-an-interactive-cluster.json b/scraped_kb_articles/unable-to-attach-init-scripts-to-an-interactive-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..21cbf77aba198e11b8a9fd64ff7dd5f91bbd5ee2 --- /dev/null +++ b/scraped_kb_articles/unable-to-attach-init-scripts-to-an-interactive-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unable-to-attach-init-scripts-to-an-interactive-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to attach an init script to an interactive cluster, you encounter the following error.\nInit script failure: PERMISSION_DENIED: User or group ID: does not exist in the workspace\nCause\nThe cluster creator’s\n\nand\n\nare no longer part of any Databricks group, or have been removed from the workspace.\nSolution\nVerify whether the cluster creator’s\n\nand\n\nexist in the workspace and group. Review the\nList users\nAPI documentation for details on retrieving user and group details.\nIf the user is missing:\nRe-add them to the appropriate groups and workspace.\nEnsure the user has the necessary permissions to manage and attach init scripts to the cluster.\nIf the user is no longer available, reassign cluster ownership to an active user with the required permissions. For more information, review the\nChange cluster owner\nAPI documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-cast-string-to-varchar.json b/scraped_kb_articles/unable-to-cast-string-to-varchar.json new file mode 100644 index 0000000000000000000000000000000000000000..1550fb5f22d39e0c4586025c5b373bc56c5b4beb --- /dev/null +++ b/scraped_kb_articles/unable-to-cast-string-to-varchar.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/unable-to-cast-string-to-varchar", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to cast a\nstring\ntype column to\nvarchar\nbut it isn’t working.\nDelete\nInfo\nThe varchar data type (\nAWS\n|\nAzure\n|\nGCP\n) is available in Databricks Runtime 8.0 and above.\nCreate a simple Delta table, with one column as type\nstring\n.\n%sql\r\n\r\nCREATE OR REPLACE TABLE delta_table1 (`col1` string)\r\nUSING DELTA;\nUse\nSHOW TABLE\non the newly created table and it reports a\nstring\ntype.\n%sql\r\n\r\nSHOW CREATE TABLE delta_table1;\nCreate a second Delta table, based on the first, and convert the\nstring\ntype column into\nvarchar\n.\n%sql\r\n\r\nCREATE OR REPLACE TABLE delta_varchar_table1\r\nUSING DELTA\r\nAS\r\nSELECT cast(col1 AS VARCHAR(1000)) FROM delta_table1;\nUse\nSHOW TABLE\non the newly created table and it reports that the table got created, but the column is\nstring\ntype.\n%sql\r\n\r\nSHOW CREATE TABLE delta_varchar_table1;\nCause\nThe\nvarchar\ntype can only be used in table schema. It cannot be used in functions or operators.\nPlease review the Spark\nsupported data types\ndocumentation for more information.\nSolution\nYou cannot cast\nstring\nto\nvarchar\n, but you can create a\nvarchar\nDelta table.\n%sql\r\n\r\nCREATE OR REPLACE TABLE delta_varchar_table2 (`col1` VARCHAR(1000))\r\nUSING DELTA;\nUse\nSHOW TABLE\non the newly created table and it reports a\nvarchar\ntype.\n%sql\r\n\r\nSHOW CREATE TABLE delta_varchar_table2;\nYou can now create another\nvarchar\nDelta table, based on the first, and it keeps the\nvarchar\ntype.\n%sql\r\n\r\nCREATE OR REPLACE TABLE delta_varchar_table3\r\nUSING DELTA\r\nAS\r\nSELECT * FROM delta_varchar_table2;\nUse\nSHOW TABLE\non the newly created table and it reports a\nvarchar\ntype.\n%sql\r\n\r\nSHOW CREATE TABLE delta_varchar_table3;" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-connect-endpoints-via-non-httphttps-port-number-when-using-standard-formerly-shared-access-mode.json b/scraped_kb_articles/unable-to-connect-endpoints-via-non-httphttps-port-number-when-using-standard-formerly-shared-access-mode.json new file mode 100644 index 0000000000000000000000000000000000000000..a49f4caa39a35ace7b8788a2c0848d30d32008d6 --- /dev/null +++ b/scraped_kb_articles/unable-to-connect-endpoints-via-non-httphttps-port-number-when-using-standard-formerly-shared-access-mode.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unable-to-connect-endpoints-via-non-httphttps-port-number-when-using-standard-formerly-shared-access-mode", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re trying to connect to a custom endpoint within the same VPC/VNet as your workspace. The endpoint is hosted on non-standard ports (ports other than HTTP-80 or HTTPS-443). You receive a\nConnection Refused\nerror during connection attempts.\nYou notice the issue when working on a standard (formerly shared) access mode cluster on Databricks Runtime 11.3 LTS or below.\nCause\nDatabricks Runtimes 11.3 LTS and below restrict outbound access to certain ports by default on standard access mode clusters, even within the same VPC/VNet.\nSolution\nDatabricks recommends using Databricks Runtime 12.2 LTS or above with standard access mode.\nIf you prefer to continue using Databricks Runtime 11.3 LTS or below, use the following cluster-scoped init script to allow access to the custom endpoint. This script continuously ensures that outbound traffic to the specified port and CIDR is allowed from the cluster nodes.\nNote\nEnsure that the target endpoint falls within the Databricks VPC/VNet CIDR block.\n#!/bin/bash\r\ncat << 'EOF' > /tmp/set_rules.sh\r\n#!/bin/bash\r\nset -x\r\nsleep_interval=30s\r\n\r\nport=\"\"  ## Change this to your target port\r\ncidr=\"\"  ## Replace with your workspace VPC/VNet CIDR\r\n\r\nwhile true; do\r\n    rules=$(iptables -L | grep -i \"$port\")\r\n    if [[ \"$rules\" != *\"dpt:$port\"* && $(getent group spark-users) ]]; then\r\n        echo \"Changing rules at $(date)\" \r\n        iptables -I OUTPUT 2 -d $cidr -j ACCEPT -p tcp --dport $port\r\n    fi\r\n    sleep ${sleep_interval}\r\ndone\r\nEOF\r\n\r\nchmod a+x /tmp/set_rules.sh\r\n/tmp/set_rules.sh >> /tmp/set_rules.log & disown" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-edit-the-manage-principal-for-akv-backed-secret-scopes-in-the-databricks-ui-.json b/scraped_kb_articles/unable-to-edit-the-manage-principal-for-akv-backed-secret-scopes-in-the-databricks-ui-.json new file mode 100644 index 0000000000000000000000000000000000000000..826e2e0efb436fa8489387bf946951ab5b3bd5be --- /dev/null +++ b/scraped_kb_articles/unable-to-edit-the-manage-principal-for-akv-backed-secret-scopes-in-the-databricks-ui-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/unable-to-edit-the-manage-principal-for-akv-backed-secret-scopes-in-the-databricks-ui-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile creating an Azure Key Vault (AKV)-backed secret scope, you don’t set permissions for other users to access it. When you try to go back and change the setting to\nAll Users\nvia the UI, you encounter a\n“permission denied”\nerror.\nCause\nWhen creating AKV-backed secret scopes in Databricks, the 'Manage Principal' default is set to 'Creator', meaning only the creator and workspace admins can list and read the secrets from the scope.\nNon-admin users, even if they have the required permissions on the Azure side, cannot access the secrets due to the independent nature of Databricks' Access Control List (ACL) and AKV's access policies.\nSolution\nUse the Databricks CLI to edit the\nManage Principal\nvalue for the existing secret scope.\nThe following command can be used to grant read permissions to all users in the Databricks workspace. The\nprincipal\ncan be a user, a group, or a service principal. The permission parameters are all capitalized (\nREAD\n,\nWRITE\n,\nMANAGE\n).\ndatabricks secrets put-acl --scope  --principal users --permission READ\nFor more information, refer to the\nSecrets\ndocumentation.\nAlternatively, create a new secret scope with the desired\nManage Principal\nsetting.\nWhether you edit an existing secret scope or create a new one, consider creating a group with the desired users and granting the necessary permissions to the group. This simplifies permission management and ensures that all relevant users have access to the secret scope.\nFor more detail on managing secrets and ACLs via the Databricks API, refer to the\nSecret\nAPI documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-exclude-columns-from-a-table-based-on-specific-strings-in-the-comments.json b/scraped_kb_articles/unable-to-exclude-columns-from-a-table-based-on-specific-strings-in-the-comments.json new file mode 100644 index 0000000000000000000000000000000000000000..8835d5b4dfa64376b85141592d08a904920c76da --- /dev/null +++ b/scraped_kb_articles/unable-to-exclude-columns-from-a-table-based-on-specific-strings-in-the-comments.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/unable-to-exclude-columns-from-a-table-based-on-specific-strings-in-the-comments", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with tables in Databricks, you want to filter columns based on specific strings in their comments. For example, you might want to classify columns as “internal” and then pull all columns excluding those classified as “internal” into a BI tool using an SQL query.\nCause\nThere is not a direct SQL method to filter columns based on their comments. However, Databricks provides the\nDESCRIBE TABLE\ncommand, which you can leverage.\nSolution\nUse the following SQL code to retrieve column metadata, including comments, and then filter columns programmatically.\n%sql\r\nWITH column_info AS (\r\n DESCRIBE TABLE \r\n),\r\nfiltered_columns AS (\r\n SELECT col_name\r\n FROM column_info\r\n WHERE comment NOT LIKE '%internal%'\r\n)\r\nSELECT\r\n (SELECT STRING_AGG(col_name, ', ') FROM filtered_columns)\r\nFROM ;\nTo verify the solution works, after executing the constructed SQL query check that the expected columns are returned without the ones containing the specific string in their comments." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-get-apache-spark-sparkenv-settings-via-pyspark.json b/scraped_kb_articles/unable-to-get-apache-spark-sparkenv-settings-via-pyspark.json new file mode 100644 index 0000000000000000000000000000000000000000..9d7cd3c7f873d1b293deaddb82f543cc02525357 --- /dev/null +++ b/scraped_kb_articles/unable-to-get-apache-spark-sparkenv-settings-via-pyspark.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/unable-to-get-apache-spark-sparkenv-settings-via-pyspark", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re able to get Apache Spark settings using\nSparkEnv.get.conf.get()\nin Scala, but want to use the PySpark equivalent instead.\nScala example\nimport org.apache.spark.SparkEnv\r\n\r\nval res = spark.range(1).rdd.map(_ => SparkEnv.get.conf.get(\"test\", \"default\")).collect()\nCause\nPySpark doesn't provide a direct equivalent to Scala's\nSparkEnv.get.conf.get()\nthat can be safely used on executors. This is due to the differences in how Scala and Python interact with the JVM in Spark.\nSolution\nUse the following steps to obtain the same output using PySpark.\nRetrieve the value of the configuration parameter\n\"test\"\nfrom the SparkConf object.\nBroadcast\ntest_value\nto all worker nodes in the Spark cluster.\nApply the map transformation that replaces each element with the value of the broadcast variable.\nExample code\n# Get the value of \"test\" from SparkConf, or use \"default\" if not set\r\ntest_value = sc.getConf().get(\"test\", \"default\")\r\n\r\n# Broadcast the test_value to all worker nodes to perform map operation later\r\nbroadcast_test_value = sc.broadcast(test_value)\r\n\r\n# Create an RDD with a single element, transform it, and collect the result\r\nres = spark.range(1).rdd.map(lambda _: broadcast_test_value.value).collect()\r\n\r\n# Print the result\r\nprint(res)" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-import-from-dbc-file-invalid_parameter_value-error.json b/scraped_kb_articles/unable-to-import-from-dbc-file-invalid_parameter_value-error.json new file mode 100644 index 0000000000000000000000000000000000000000..1afc1473c97d330b4f488331f292d491c392bcc6 --- /dev/null +++ b/scraped_kb_articles/unable-to-import-from-dbc-file-invalid_parameter_value-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/unable-to-import-from-dbc-file-invalid_parameter_value-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou previously exported a DBC file using the Databricks REST API (\nAWS\n|\nAzure\n|\nGCP\n). When you try to import that same DBC file into your Databricks workspace you get an invalid parameter value error message.\nINVALID_PARAMETER_VALUE: The dbc file may not be valid or may be an unsupported version.\nInfo\nThis can occur when using either the\nDBC\nor\nAuto\nformat for when importing and exporting files or notebooks.\nCause\nYou are missing the\ndirect_download\nparameter when using the API to export. By default, the export API returns a JSON object with base64-encoded content. This is not valid for DBC files. When the\ndirect_download\nparameter is set to\ntrue\n, the API call returns the raw DBC file, without encoding it.\nSolution\nYou should always use\n\"direct_download\": \"true\"\nwhen exporting files or notebooks via the API.\nExample code\n%sh\r\ncurl -X POST \\\r\n  https:///api/2.0/workspace/export \\\r\n  -H \"Authorization: Bearer \" \\\r\n  -H \"Content-Type: application/json\" \\\r\n  -d '{\r\n    \"path\": \"\",\r\n    \"format\": \"AUTO\",\r\n    \"direct_download\": \"true\"\r\n  }'\nRe-export the DBC file using the updated API call.\nImport the newly exported DBC file to verify it is working correctly.\nInfo\nTo avoid similar issues in the future, always ensure the\ndirect_download\nparameter is set to\ntrue\nwhen exporting files using the API." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-infer-schema-for-orc-error.json b/scraped_kb_articles/unable-to-infer-schema-for-orc-error.json new file mode 100644 index 0000000000000000000000000000000000000000..a8126e74b14a32640cb1d97f5df1e676e05aa016 --- /dev/null +++ b/scraped_kb_articles/unable-to-infer-schema-for-orc-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/unable-to-infer-schema-for-orc-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to read ORC files from a directory when you get an error message:\norg.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.\nCause\nAn\nUnable to infer the schema for ORC\nerror occurs when the schema is not defined and Apache Spark cannot infer the schema due to:\nAn empty directory.\nUsing the base path instead of the complete path to the files when there are multiple subfolders containing ORC files.\nEmpty directory example\nCreate an empty directory\n/tmp/testorc_empty\n.\n%sh mkdir /dbfs/tmp/testorc_empty\nAttempt to read the directory.\nval df = spark.read.orc(\"dbfs:/tmp/testorc_empty\")\nThe read fails with an\nUnable to infer the schema for ORC\nerror.\nBase path example\nWhen only the base path is given (instead of the complete path) and there are multiple subfolders containing orc files, a read attempt returns the error:\nUnable to infer the schema for ORC\n.\nCreate multiple folders under\n/tmp/testorc\n.\nimport org.apache.hadoop.fs.Path\r\nval basePath = \"dbfs:/tmp/testorc\"\r\nspark.range(1).toDF(\"a\").write.orc(new Path(basePath, \"first\").toString)\r\nspark.range(1,2).toDF(\"a\").write.orc(new Path(basePath, \"second\").toString)\r\nspark.range(2,3).toDF(\"a\").write.orc(new Path(basePath, \"third\").toString)\nAttempt to read the directory\n/tmp/testorc\n.\nval df = spark.read.orc(basePath)\nThe read fails with an\nUnable to infer scheme for ORC\nerror.\nSolution\nEmpty directory solution\nCreate an empty directory\n/tmp/testorc_empty\n.\n%sh mkdir /dbfs/tmp/testorc_empty\nInclude the schema when you attempt to read the directory.\nval df_schema = spark.read.schema(\"a int\").orc(\"dbfs:/tmp/testorc_empty\")\nThe read attempt does not return an error.\nBase path solution\nCreate multiple folders under\n/tmp/testorc\n.\nimport org.apache.hadoop.fs.Path\r\nval basePath = \"dbfs:/tmp/testorc\"\r\nspark.range(1).toDF(\"a\").write.orc(new Path(basePath, \"first1\").toString)\r\nspark.range(1,2).toDF(\"a\").write.orc(new Path(basePath, \"second2\").toString)\r\nspark.range(2,3).toDF(\"a\").write.orc(new Path(basePath, \"third3\").toString)\nInclude the schema and a full path to one of the subfolders when you attempt to read the directory. In this example, we are using the path to the folder\n/third3/\n.\nval dfWithSchema = spark.read.schema(\"a long\").orc(basePath + \"/third3/\")\nThe read attempt does not return an error." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-install-r-package-survminer-on-cluster.json b/scraped_kb_articles/unable-to-install-r-package-survminer-on-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..3781a3a2c71dabc7bdbb3fd09e68eab4ca1681b9 --- /dev/null +++ b/scraped_kb_articles/unable-to-install-r-package-survminer-on-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/unable-to-install-r-package-survminer-on-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to use CRAN to install the R package\nSurvminer\non a Databricks cluster using the Libraries section, you encounter an issue and receive an error message.\n'installation of package ‘’ had non-zero exit status'\nCause\nThe\nSurvminer\npackage is not available for R version 4.3.2 used in the Databricks cluster. Attempts to install the package and its dependencies directly from CRAN or GitHub fail due to compatibility issues with the R version. Additionally, network restrictions or CRAN mirror issues may prevent you from installing the package.\nSolution\nUse an init script to install a compatible version of R and the necessary dependencies, such as the following example.\n#!/bin/bash\r\n\r\n# Update the package list\r\nsudo apt-get update\r\n\r\n# Install R\r\nsudo apt-get install -y r-base\r\n\r\n# Install necessary dependencies for R packages\r\nsudo apt-get install -y libcurl4-openssl-dev libxml2-dev libssl-dev\r\n\r\n# Install remotes package to enable installation of specific versions\r\nsudo R -e \"install.packages('remotes', repos = 'http://cran.us.r-project.org')\"\r\n\r\n# Install specific version of Matrix package\r\nsudo R -e \"remotes::install_version('Matrix', version = '1.6-2', repos = 'http://cran.us.r-project.org')\"\r\n\r\n# Install Survminer package\r\nsudo R -e \"install.packages('survminer', repos = 'http://cran.us.r-project.org')\"\nAttach the init script to your cluster.\nRestart the cluster.\nIf the issue persists, check for network restrictions or firewall rules that may be blocking access to the CRAN repository. Ensure that the proxy settings are correctly configured if your network uses a proxy.\nIf needed, you can try installing the package from a known CRAN mirror directly in the notebook using the following command.\ninstall.packages('survminer', repos = 'http://cran.us.r-project.org')" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-manage-pipeline-permissions-despite-being-pipeline-owner-or-workspace-admin.json b/scraped_kb_articles/unable-to-manage-pipeline-permissions-despite-being-pipeline-owner-or-workspace-admin.json new file mode 100644 index 0000000000000000000000000000000000000000..3563ee70283fce320c84b2d8962ffcf04383ad29 --- /dev/null +++ b/scraped_kb_articles/unable-to-manage-pipeline-permissions-despite-being-pipeline-owner-or-workspace-admin.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/unable-to-manage-pipeline-permissions-despite-being-pipeline-owner-or-workspace-admin", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-map-a-databricks-secret-scope-to-azure-key-vault-with-a-databricks-managed-service-principal.json b/scraped_kb_articles/unable-to-map-a-databricks-secret-scope-to-azure-key-vault-with-a-databricks-managed-service-principal.json new file mode 100644 index 0000000000000000000000000000000000000000..dd2a99217bd81c06d5b16a74740efced3a629ab7 --- /dev/null +++ b/scraped_kb_articles/unable-to-map-a-databricks-secret-scope-to-azure-key-vault-with-a-databricks-managed-service-principal.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/terraform/unable-to-map-a-databricks-secret-scope-to-azure-key-vault-with-a-databricks-managed-service-principal", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to map a Databricks secret scope to Azure Key Vault using a Databricks-managed service principal in Terraform, you receive the following error.\nError Msg:\r\n│ Error: cannot create secret scope: Scope with Azure KeyVault must have userAADToken defined!\nCause\nThe Databricks-managed service principal cannot access the Azure Key Vault.\nSolution\nUse Terraform to create the secret scope in Azure Key Vault directly instead.\nCreate an Azure-based service principal using Terraform. Follow the steps provided in the Terraform\ndatabricks_service_principal Resource\ndocumentation. The following code is an example.\nresource \"databricks_service_principal\" \"sp\" {\r\n\r\napplication_id = \"\"\r\n\r\ndisplay_name = \"\"\r\n\r\nallow_cluster_create = true\r\n\r\n}\nOnce you have created the service principal, use it to authenticate with Databricks. For detailed steps on how to authenticate access to Databricks resources, refer to the Databricks\nAuthorizing access to Databricks resources\ndocumentation.\nUse the Azure-managed service principal for authentication you created in step 1 to create an Azure Key Vault-backed secret scope in Databricks. For instructions, refer to the\nCreate an Azure Key Vault-backed secret scope\nsection of the\nSecret management\ndocumentation.\nSupporting details\nAzure Key Vault-backed secret scopes are read-only interfaces to the Azure Key Vault. You must use an Azure-managed service principal to create and manage these secret scopes.\nPersonal Access Tokens (PATs) cannot be used to create Azure Key Vault-backed secret scopes. You need to use one of the Azure-specific authentication methods.\nIf you encounter errors while creating the secret scope using Terraform, ensure that the service principal has the necessary permissions on the Azure Key Vault. For more information, review the requirements in the\nSecret management\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-parallelize-the-code-using-the-apply-api-from-pandas-on-pyspark.json b/scraped_kb_articles/unable-to-parallelize-the-code-using-the-apply-api-from-pandas-on-pyspark.json new file mode 100644 index 0000000000000000000000000000000000000000..85457967aaaea4a143ad719701a88ba3ae7c37b9 --- /dev/null +++ b/scraped_kb_articles/unable-to-parallelize-the-code-using-the-apply-api-from-pandas-on-pyspark.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/unable-to-parallelize-the-code-using-the-apply-api-from-pandas-on-pyspark", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen applying computation on a column in Pandas on PySpark Dataframe, the execution does not happen in parallel.\nExample\nIn the following code, a Pandas library is imported from Pyspark and a Pyspark Pandas DF named\ndf\nis defined. A function\ntemp_function\nis also included, to apply to the\n'col3'\ncolumn using the\napply\nmethod. The function also includes a type hint.\n%python\r\n\r\n# This code runs on driver\r\n\r\nimport time\r\nfrom pyspark import pandas as ps\r\nimport numpy as np\r\n\r\ndf2 = ps.DataFrame(np.random.choice([1,2,3], size=(3,3))).rename(columns={0: 'col1', 1: 'col2', 2:'col3'})\r\nprint(type(df2))\r\n\r\ndef temp_function(x) -> int:\r\n    print(\"Hello\")\r\n    time.sleep(5)\r\n    print(\"Func_end\")\r\n    return x\r\n\r\ndf2['col3'] = df2['col1'].apply(lambda x: temp_function(x))\nHowever, this code does not achieve parallel execution as expected. Instead, all operations are confined to the driver.\nIn the following image, all print statements from\ntemp_function\nappear directly in the console, rather than in the executor logs. Moreover, the execution is eager, with an Apache Spark job being triggered twice: first when the function is executed and again when the DataFrame is printed.\nCause\nThe lack of parallel execution in the code stems from the use of the\napply\nmethod with a Python lambda function, which inherently does not have a return type. Although\ntemp_function\nis defined with a type hint, the lambda wrapper used in the\napply\nmethod does not relay this information to PySpark's execution engine.\nConsequently, PySpark does not interpret or distribute the execution across the cluster. Instead, it operates only on the driver node, treating the function as a Python function rather than a distributed operation, leading to the eager and non-parallel execution observed.\nSolution\nTo ensure parallel execution of operations on a PySpark DataFrame, directly use the\napply\nfunction from\npyspark.pandas\nwithout wrapping it in a lambda function.\nThis approach uses PySpark's capability to distribute tasks across the cluster's executors. By specifying the return type directly in the function definition (\ntemp_function\n), the return type is read and not inferred, which reduces the execution overhead.\nimport time \r\nimport numpy as np \r\nfrom pyspark import pandas as ps\r\n\r\ndf = ps.DataFrame(np.random.choice([1, 2, 3], size=(3, 3))).rename(columns={0: 'col1', 1: 'col2', 2: 'col3'}) \r\nprint(type(df)) \r\n\r\ndef temp_function(x) -> int: \r\nprint(\"Hello\") \r\ntime.sleep(10)\r\nprint(\"Func_end\") \r\nreturn x\r\n\r\ndf['col3'] = df['col1'].apply(temp_function)\nIn the following image, the print statements within\ntemp_function\nare now logged in the executor logs instead of the console, confirming the distributed nature of the task execution." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-pass-a-param-string-value-of-more-than-65535-characters-in-a-workflow-that-uses-a-jar-in-a-job.json b/scraped_kb_articles/unable-to-pass-a-param-string-value-of-more-than-65535-characters-in-a-workflow-that-uses-a-jar-in-a-job.json new file mode 100644 index 0000000000000000000000000000000000000000..5a05e74fbeb942c2f714b35f27ab4813e5aaff39 --- /dev/null +++ b/scraped_kb_articles/unable-to-pass-a-param-string-value-of-more-than-65535-characters-in-a-workflow-that-uses-a-jar-in-a-job.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/unable-to-pass-a-param-string-value-of-more-than-65535-characters-in-a-workflow-that-uses-a-jar-in-a-job", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile creating a workflow using a JAR in your job, you notice you are unable to pass a param string value of more than 65,535 characters. You receive a message\nerror: Error while emitting $iw UTF8 string too large\n.\nCause\nThe Scala compiler embedded in the Scala REPL is designed to accept job parameters up to 65,535 characters. When the length of the parameters in the job param exceeds this limit, the compiler throws an error, causing the job to fail.\nSolution\nPass the param using a\n.txt\nfile in\nWorkspace FileSystem (WSFS).\n1. Create a file in WSFS.\nNavigate to your desired location in the Databricks workspace.\nCreate a new file, such as\nparams.txt\n.\nAdd parameters to the file, such as\nparam1 = value1\n.\nSave the file.\n2. Get the WSFS file path.\nRight-click the file and select\nCopy path\n.\nThe path will look like\n/Workspace/Users/your-username/params.txt\n.\n3. Modify your JAR to read parameters from a file.\nUpdate your main class to accept a file path as an argument.\nImplement logic to read and parse the file contents.\nExample\npackage com.example.demo;\r\nimport java.io.BufferedReader;\r\nimport java.io.File;\r\nimport java.io.FileReader;\r\nimport java.io.IOException;\r\npublic class DemoApplicationFileReader {\r\n   public static void main(String[] args) {\r\n       if (args.length == 0) {\r\n           System.out.println(\"Please provide a file path as an argument.\");\r\n           return;\r\n       }\r\n       String filePath = args[0];\r\n       File file = new File(filePath);\r\n       if (!file.exists()) {\r\n           System.out.println(\"Error: The file at '\" + filePath + \"' does not exist.\");\r\n           return;\r\n       }\r\n       if (!file.isFile()) {\r\n           System.out.println(\"Error: '\" + filePath + \"' is not a regular file.\");\r\n           return;\r\n       }\r\n       try (BufferedReader reader = new BufferedReader(new FileReader(file))) {\r\n           String line;\r\n           System.out.println(\"Contents of file: \" + file.getAbsolutePath());\r\n           System.out.println(\"-----------------------------\");\r\n           while ((line = reader.readLine()) != null) {\r\n               System.out.println(line);\r\n           }\r\n       } catch (IOException e) {\r\n           System.out.println(\"An error occurred while reading the file:\");\r\n           e.printStackTrace();\r\n       }\r\n   }\r\n}\n4. Create a Databricks job.\nGo to\nWorkflows\nand create a new job.\nSet the job type to\nJAR\n.\nSpecify the DBFS path to your JAR.\nIn the\nParameters\nfield, enter the WSFS path to your parameter file.\n5. Save the job configuration and run it\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-push-github-actions-workflows-from-a-databricks-repository.json b/scraped_kb_articles/unable-to-push-github-actions-workflows-from-a-databricks-repository.json new file mode 100644 index 0000000000000000000000000000000000000000..5135cb64b11b597bbd4aa1f0bf388d84a360a452 --- /dev/null +++ b/scraped_kb_articles/unable-to-push-github-actions-workflows-from-a-databricks-repository.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/unable-to-push-github-actions-workflows-from-a-databricks-repository", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile attempting to push a GitHub Actions workflow file to your repository's\n.github/workflows/\nfolder using a Git folder within your Databricks environment, you encounter the following error.\nError pushing changes: Remote ref update was rejected. Make sure you have write access to this remote repository.\nThis issue is specific to GitHub Actions workflow files and does not affect other types of files.\nCause\nYou're using the Databricks GitHub App to connect to your remote repository. While this method supports standard Git operations, it does not have permission to modify GitHub Actions workflow files in the\n.github/workflows/\ndirectory.\nSolution\nTo push changes to workflow files, you'll need to use a Github Personal Access Token (PAT) with\nrepo\nand\nworkflow\nscopes for a repository using GitHub Actions. (Integration with GitHub Action workflows only supports PAT, not OAuth.)\nSteps to generate and configure a Github PAT:\nIn the upper-right corner of any GitHub page, click your profile photo and select\nSettings\n.\nNavigate to\nDeveloper settings\n.\nClick\nPersonal access tokens\n, then select\nTokens (fine-grained tokens)\n.\nClick\nGenerate new token\n.\nProvide a token name description. Set the\nResource owner\nand\nRepository access\nfields accordingly.\nUnder\nPermissions\n, change the workflow scope to\nRead and write access\n. Set additional permissions accordingly.\nClick\nGenerate token\n.\nCopy the Github token and enter it in your Databricks workspace by navigating to\nUser Settings > Linked Accounts > Git Integration > Personal access token\n.\nFor more information, review the\nConfigure Git credentials & connect a remote repo to Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-read-delta-table-with-deletion-vectors.json b/scraped_kb_articles/unable-to-read-delta-table-with-deletion-vectors.json new file mode 100644 index 0000000000000000000000000000000000000000..4eefba4bc514e60088791cf8c733133d2160fb96 --- /dev/null +++ b/scraped_kb_articles/unable-to-read-delta-table-with-deletion-vectors.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/unable-to-read-delta-table-with-deletion-vectors", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou receive an error when trying to read a Delta table.\njava.lang.RuntimeException: Unable to read this table because it requires reader table feature(s) that is unsupported by this version of Databricks: deletionVectors.\nCause\nDelta tables with deletion vectors enabled can only be queried using clusters with Databricks Runtime 12.2 LTS - 15.3 (current).\nNote\nIf you create a new Delta table using the SQL warehouse (which has Databricks Runtime 14.3 LTS - 15.3 (current)), the table will be created with deletion vectors and hold the corresponding properties and policies.\nSolution\nUse the cluster with Databricks Runtime 12.2 LTS - 15.3 (current) to query all deletion-vector-enabled Delta tables.\nFor more information on deletion vectors, including compatibility with Delta clients, please review the\nWhat are deletion vectors?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nIf you are unable to change your Databricks Runtime version, you can temporarily disable the\nAuto-Enable Deletion Vectors\nin your workspace settings.\nImportant\nDatabricks recommends using the deletion vector feature as it helps reduce transaction time for\nDELETE/UPDATE\nprocessing in Delta tables." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-read-external-hive-metastore-tables-on-sql-warehouses.json b/scraped_kb_articles/unable-to-read-external-hive-metastore-tables-on-sql-warehouses.json new file mode 100644 index 0000000000000000000000000000000000000000..ee4e6cf69a218e53285ac3a34dcec87296f975fe --- /dev/null +++ b/scraped_kb_articles/unable-to-read-external-hive-metastore-tables-on-sql-warehouses.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/unable-to-read-external-hive-metastore-tables-on-sql-warehouses", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using a Databricks SQL warehouse to run queries against tables on an external Hive metastore (HMS), you receive an error.\n[TABLE_OR_VIEW_NOT_FOUND] The table or view `` cannot be found.\nCause\nA global init script is configured in Databricks to set the necessary Apache Spark configurations for an external HMS, but SQL warehouses do not execute global init scripts.\nSolution\nManually set the appropriate Spark configurations in your admin settings for SQL warehouses to enable your warehouse to access an external HMS table.\nFor details, review the\nEnable data access configuration\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-run-interactive-workloads-using-a-dedicated-formerly-single-user-compute-assigned-to-a-service-principal-where-you-have-the-service-principal-user-role.json b/scraped_kb_articles/unable-to-run-interactive-workloads-using-a-dedicated-formerly-single-user-compute-assigned-to-a-service-principal-where-you-have-the-service-principal-user-role.json new file mode 100644 index 0000000000000000000000000000000000000000..4019d0ed82698cda5758abd16ef1900959e9c1ab --- /dev/null +++ b/scraped_kb_articles/unable-to-run-interactive-workloads-using-a-dedicated-formerly-single-user-compute-assigned-to-a-service-principal-where-you-have-the-service-principal-user-role.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unable-to-run-interactive-workloads-using-a-dedicated-formerly-single-user-compute-assigned-to-a-service-principal-where-you-have-the-service-principal-user-role", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re working in an all-purpose compute with dedicated (formerly single user) access mode. The access mode is assigned to a service principal where you have the Service Principal User role. When you attempt to run an interactive workload using this compute, you receive the following error message.\nSingle-user check failed: user '' attempted to run a command on single-user cluster , but the single user of this cluster is ''\nCause\nDedicated access mode assigned to a service principal can only be used for workflows (jobs, tasks, or pipelines) set up to run as the service principal.\nDedicated access mode does not support interactive workloads when assigned to a service principal.\nSolution\nImportant\nDatabricks no longer supports using an all-purpose compute with dedicated (formerly single user) access mode assigned to a service principal.\nDatabricks recommends migrating your all-purpose compute to standard (formerly shared) access mode. Alternatively, you can migrate to dedicated access mode assigned to a user or group instead. The choice depends on your workload use case requirements. Refer to the\nCompute access mode limitations for Unity Catalog\n(\nAWS\n|\nAzure\n|\nGCP\n)  documentation for more information.\nNote\nFor cases where you need a cluster to be available quickly for a job, Databricks recommends using serverless compute instead. For more information, refer to the\nRun your Databricks job with serverless compute for workflows\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nWhen you need to use an all-purpose compute assigned to a service principal:\nCreate a job workflow with\nJob details > Run as\nset to this service principal.\nUse the API or Python SDK to set the task’s\nCompute*\nto the all-purpose compute assigned to the service principal. This assigns the compute to the task.\nThe following sections provide example code for using the API and the SDK respectively.\nFor more information, refer to the\nOrchestration using Databricks Jobs\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nAPI\ncurl -X POST https://.cloud.databricks.com/api/2.2/jobs/update \\\r\n-H 'Authorization: Bearer ' \\\r\n-H \"Content-Type: application/json\" \\\r\n-d '{\r\n      \"job_id\": \"\",\r\n      \"new_settings\": {\r\n        \"tasks\": [\r\n          {\r\n            \"task_key\": \"\",\r\n            \"existing_cluster_id\": \"\"\r\n          }\r\n        ]\r\n      }\r\n    }'\nPython SDK\nfrom databricks.sdk import WorkspaceClient\r\nfrom databricks.sdk.service import jobs\r\n\r\n# Create a workspace client\r\nw = WorkspaceClient()\r\n\r\n# Define the job ID to update\r\njob_id = \"\"\r\n\r\n# Define the updated settings for the job, including the cluster settings\r\nupdated_settings = jobs.JobSettings(\r\n    tasks=[\r\n        jobs.Task(\r\n            task_key=\"\",\r\n            existing_cluster_id=\"\",  # Provide the cluster ID to use\r\n        )\r\n    ]\r\n)\r\n\r\n# Update the job with the new settings\r\nw.jobs.update(job_id=job_id, new_settings=updated_settings)" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-select-a-single-node-cluster-when-using-the-default-job-compute-policy.json b/scraped_kb_articles/unable-to-select-a-single-node-cluster-when-using-the-default-job-compute-policy.json new file mode 100644 index 0000000000000000000000000000000000000000..5259146ae189de3a7750cb4033012e7728511a19 --- /dev/null +++ b/scraped_kb_articles/unable-to-select-a-single-node-cluster-when-using-the-default-job-compute-policy.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/unable-to-select-a-single-node-cluster-when-using-the-default-job-compute-policy", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile setting up a new job cluster, you indicate the default policy,\nJob Compute\n, and\nSingle user\nunder\nAccess mode\n. You notice a tooltip in the UI saying you can select “single-node” from the compute mode while your indicated number of workers is set to\n1\nbut cannot see how to select the single node cluster in the UI.\nThe following screenshot shows the UI with the tooltip after selecting 1 worker.\nCause\nThe\nJob Compute\ndefault policy is designed to create clusters with multiple workers.\nSolution\nCreate a custom job policy that instructs Databricks to create a single-node cluster. Use the following configuration as a minimum. For details on creating a custom job policy, refer to the “Use policy families to create custom policies” section of the\nDefault policies and policy families\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n{\r\n  \"cluster_type\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"job\"\r\n  },\r\n  \"num_workers\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": 0\r\n  },\r\n  \"spark_conf.spark.databricks.cluster.profile\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"singleNode\"\r\n  },\r\n  \"spark_conf.spark.master\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"local[*,4]\"\r\n  },\r\n  \"custom_tags.ResourceClass\": {\r\n    \"type\": \"fixed\",\r\n    \"value\": \"SingleNode\"\r\n  }\r\n}\nThen, when you return to the job cluster setup UI view, you can indicate your custom policy, and see\nSingle node\nselected as well." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-update-a-serving-endpoint-even-with-necessary-permissions.json b/scraped_kb_articles/unable-to-update-a-serving-endpoint-even-with-necessary-permissions.json new file mode 100644 index 0000000000000000000000000000000000000000..2000f054ffcb85770de0314fcb205b9fe8481903 --- /dev/null +++ b/scraped_kb_articles/unable-to-update-a-serving-endpoint-even-with-necessary-permissions.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/unable-to-update-a-serving-endpoint-even-with-necessary-permissions", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are unable to update a serving endpoint in Databricks, even if you have the necessary permissions. The error message observed is:\n'\nUser does not have permission 'View' on Model '\n.\nThis issue occurs in the Machine Learning environment when working with serving endpoints and registered models.\nCause\nThe user who created the registered model-serving endpoint no longer has permissions. Databricks uses a creator’s credentials to determine downstream permissions.\nImportant!\nWhen the creator no longer has permissions, neither other users, admins, nor Databricks support will be able to update the endpoint.\nSolution\nIf the user still exists in the workspace:\nRe-add the necessary permissions for the user who created the registered model-serving endpoint.\nTrigger the update for the model-serving endpoint once the permissions are restored.\nVerify that the model-serving endpoint is updated to use the latest model version as required.\nIf the user is entirely removed from the Account/workspace, please contact Databricks support.\nNote\nFor production model-serving endpoints, we recommend using a service principal to create model-serving endpoints and upstream assets for consistent update evaluation. As an additional measure, consider registering the model in Unity Catalog (UC) and migrating the model." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-update-group-push-mapping-target-error-when-syncing-okta-groups-to-workspace.json b/scraped_kb_articles/unable-to-update-group-push-mapping-target-error-when-syncing-okta-groups-to-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..3c09bc85a31f607fa6b087aac0f7e5078b4d4db5 --- /dev/null +++ b/scraped_kb_articles/unable-to-update-group-push-mapping-target-error-when-syncing-okta-groups-to-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/unable-to-update-group-push-mapping-target-error-when-syncing-okta-groups-to-workspace", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou\nenable identity federation\n(\nAWS\n|\nAzure\n) on your Databricks workspace.\nYou are trying to push Okta groups to your workspace via SCIM, but get an\nUnable to update Group Push mapping target\nerror.\nFailed on 07-13-2022 07:10:58PM UTC: Unable to update Group Push mapping target App group\r\ndatabricks_prod_admins: Error while creating user group databricks_prod_admins: Bad Request. Errors\r\nreported by remote server: Request is unparsable, syntactically incorrect, or violates schema.\nCause\nWhen identity federation is enabled in your workspace, you cannot SCIM sync users and groups directly to the workspace. Users and groups should be centrally managed in the account-level SCIM application.\nSolution\nYou should\ndisable workspace SCIM sync in the Okta application.\nNavigate to your per-workspace application in the\nOkta configuration settings\n.\nClick the workspace application name (for example,\nDatabricks Workspace Level SCIM Application\n).\nClick\nProvisioning\n.\nIn the Settings drop down menu, click\nConfigure API Integration\n.\n.\nClick\nEdit\n.\nRemove the check mark for\nEnable API Integration\n.\nClick\nSave\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-use-dynamic-variable-passing-to-create-a-function-in-databricks-sql-warehouse.json b/scraped_kb_articles/unable-to-use-dynamic-variable-passing-to-create-a-function-in-databricks-sql-warehouse.json new file mode 100644 index 0000000000000000000000000000000000000000..01b83e102a0a1aa67334dcf1d8711c33233e2026 --- /dev/null +++ b/scraped_kb_articles/unable-to-use-dynamic-variable-passing-to-create-a-function-in-databricks-sql-warehouse.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/unable-to-use-dynamic-variable-passing-to-create-a-function-in-databricks-sql-warehouse", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to pass a dynamic variable to a Python user-defined function (UDF) and get a list of the table versioning that is required for operations\nCREATE\n,\nMERGE\n,\nINSERT\n,\nDELETE\n,\nUPSERT\n,\nUPDATE\n,\nWRITE\n,\nREPLACE\n, and\nCLONE\nwithin the DBSQL warehouse, but you are unable to create a function unless you use a notebook and restrict running your SQL queries to the SQL editor.\nCause\nDatabricks SQL Warehouse does not allow dynamic variable passing within SQL to create\nfunctions\n. (This is distinct from executing\nqueries\nby dynamically passing variables.)\nSolution\nUse a Python UDF in a notebook to dynamically pass the table name as a variable, then access the function in a notebook or DBSQL.\nExample\nFirst retrieve the version number of the most recent change (excluding the latest change) in the history where certain operations (like\nCREATE\n,\nMERGE\n,\nINSERT\n,\nDELETE\n,\netc.) were performed.\nGiven the Delta table, the retrieval returns the version along with the table name, e.x.\ntable_name@v2\n.\ndef sayHello(*args):\r\n    query = \"\"\"\r\n        CREATE OR REPLACE FUNCTION .. (table-name STRING)\r\n        RETURNS STRING\r\n        READS SQL DATA\r\n        RETURN (\r\n          WITH hist AS (\r\n            SELECT *\r\n            FROM (DESCRIBE HISTORY {0})\r\n          ),\r\n          last_version AS (\r\n            SELECT version\r\n            FROM hist\r\n            WHERE operation IN ('CREATE', 'MERGE', 'INSERT', 'DELETE', 'UPSERT', 'UPDATE', 'WRITE', 'REPLACE', 'CLONE')\r\n            ORDER BY version DESC\r\n            LIMIT 1\r\n          )\r\n          SELECT CONCAT('{0}', '@v', CAST(a.version AS STRING))\r\n          FROM hist a, last_version b \r\n          WHERE a.version < b.version\r\n          ORDER BY a.version DESC\r\n          LIMIT 1\r\n        )\r\n        \"\"\"\r\n    df = spark.sql(query.format(*args))\r\n    print(df.show(truncate=False))\r\n    return\nThen call the function using the DBSQL.\nselect .. ()" +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-use-fields-with-qualifiers-in-the-dlt-apply-changes-api.json b/scraped_kb_articles/unable-to-use-fields-with-qualifiers-in-the-dlt-apply-changes-api.json new file mode 100644 index 0000000000000000000000000000000000000000..49ff441ace80c16c22c723810bfb2e5c1d9cf2f1 --- /dev/null +++ b/scraped_kb_articles/unable-to-use-fields-with-qualifiers-in-the-dlt-apply-changes-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/unable-to-use-fields-with-qualifiers-in-the-dlt-apply-changes-api", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nDelta Live Tables (DLT) provides an\nAPPLY CHANGES\nAPI that helps simplify change data capture (CDC) use cases. When you try to reference a field with a qualifier (this may also be referred to as a 'nested field') using DLT\nAPPLY CHANGES\nAPI arguments such as\nkeys\nor\nsequence_by\n, you encounter an\nAnalysisException\n.\nExample\ndlt.apply_changes(\r\n  target = \"\",\r\n  source = \"\",\r\n  keys = [\".\"],\r\n...\r\n)\r\nAnalysisException: The Column value must be a column identifier without any qualifier.\nCause\nUsing a qualifier such as\nkey1.key2\nis not supported with required fields such as\nkeys\nand\nsequence_by\n.\nSolution\nAdd a view with\n@dlt.view\nto extract the desired columns using a Spark API such as\nselect\n, or\nwithColumn[s]\nprior to referencing them in\nAPPLY CHANGES\n.\nExample\nIn this simplified example,\ndlt_source\nis conceptually like the Bronze layer. Next,\ndlt_view\nis a logical layer on top of Bronze to help facilitate further processing with the\nAPPLY CHANGES\nAPI. Last,\ndlt_target\nis the Silver layer of the medallion architecture.\ndlt.create_streaming_table(\"dlt_target\")\r\n\r\n@dlt.view\r\ndef dlt_source_view():\r\n  return (\r\n    spark.readStream\r\n    .format(\"delta\")\r\n    .table(\"dlt_source\")\r\n    .withColumns(\r\n        {\"\": \".\"}\r\n    \t)\r\n    )\r\n\r\ndlt.apply_changes(\r\n  target = \"dlt_target\",\r\n  source = \"dlt_source_view\",\r\n  keys = [\"\"],\r\n  sequence_by = \"col1\",\r\n  stored_as_scd_type = 1\r\n)\nFor more information on DLT views, refer to the\nDLT Python language reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unable-to-view-delete-or-drop-an-external-location-in-the-ui-or-through-commands-even-with-admin-privileges.json b/scraped_kb_articles/unable-to-view-delete-or-drop-an-external-location-in-the-ui-or-through-commands-even-with-admin-privileges.json new file mode 100644 index 0000000000000000000000000000000000000000..26a4530cd1f4ac01e263fd99359cc4b269c75973 --- /dev/null +++ b/scraped_kb_articles/unable-to-view-delete-or-drop-an-external-location-in-the-ui-or-through-commands-even-with-admin-privileges.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unable-to-view-delete-or-drop-an-external-location-in-the-ui-or-through-commands-even-with-admin-privileges", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are unable to view, delete, or drop an external location in the UI or through commands, despite having admin privileges. This issue yields an error message.\nUnauthorizedAccessException: PERMISSION_DENIED: User is not an owner of External Location 'sandbox_external_location'.\nThis issue typically arises in environments where users are migrating to Unity Catalog and need to manage external locations created by other engineers.\nCause\nIn Unity Catalog, only the owner of an external location can see and manage an external location. Metastore admins do not have this permission.\nSolution\nFirst, ensure you have metastore admin rights. If you don’t, add yourself or the required user to the metastore admin privileges account console under the\nCatalog\nsection.\nNext, as a metastore admin, you can change the owner of the external location to yourself or another required user. Execute the following query to change the owner:\nALTER EXTERNAL LOCATION `sandbox_external_location` OWNER TO ``\nOnce the ownership has been changed, the new owner can proceed to delete or drop the external location using the appropriate commands.\nNote\nAdditionally, avoid storing production data in sandbox environments to prevent similar issues." +} \ No newline at end of file diff --git a/scraped_kb_articles/unclear-how-to-control-micro-batch-size-on-a-streaming-table-in-delta-live-tables-dlt.json b/scraped_kb_articles/unclear-how-to-control-micro-batch-size-on-a-streaming-table-in-delta-live-tables-dlt.json new file mode 100644 index 0000000000000000000000000000000000000000..ffc2342c074001f7559021f8efe549a5f73b391d --- /dev/null +++ b/scraped_kb_articles/unclear-how-to-control-micro-batch-size-on-a-streaming-table-in-delta-live-tables-dlt.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/unclear-how-to-control-micro-batch-size-on-a-streaming-table-in-delta-live-tables-dlt", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to c\nontrol the micro-batch size on a streaming table, which is created in the same Delta Live Tables (DLT) pipeline using rate limiters, but it is not clear how to achieve this in DLT.\nCause\nThe\ndlt.readStream()\nfunction in Delta Live Tables (DLT) does not directly support the rate limit configuration\nmaxBytesPerTrigger\noption.\nThis option is typically used with\nspark.readStream()\nto limit the amount of data read in each micro-batch during streaming.\nSolution\nUse the rate limiters along with the keyword\nLIVE\n.\nExample\nHow to use the `\nmaxFilesPerTrigger\n` option in DLT.\n%python\r\nimport dlt\r\n@dlt.table\r\ndef dlt_test_target3():\r\n\treturn spark.readStream. \\\r\n\t\toption(\"maxFilesPerTrigger\",1). \\\r\n\t\ttable(\"source_db.streamingSource\")\r\n@dlt.table\r\ndef dlt_test_target4():\r\n\treturn spark.readStream. \\\r\n\t\toption(\"maxFilesPerTrigger\", 1). \\\r\n\t\ttable(\"LIVE.dlt_test_target3\")\nIn the above example,\ndlt_test_target3\nis defined as a streaming table within the DLT and is used as a source for another streaming table\ndlt_test_target4\n. Provide the\nLIVE\nkeyword on the source table in\nspark.readStream\nto apply rate limiters using\nmaxFilesPerTrigger\n.\nImportant\nNote: This code applies to DLT processes using streaming tables as sources AND pipelines configured in continuous mode (not in triggered mode)." +} \ No newline at end of file diff --git a/scraped_kb_articles/uncommitted-files-causing-data-duplication.json b/scraped_kb_articles/uncommitted-files-causing-data-duplication.json new file mode 100644 index 0000000000000000000000000000000000000000..24e5db13463837619a38f0692107be75e7de1ca4 --- /dev/null +++ b/scraped_kb_articles/uncommitted-files-causing-data-duplication.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/uncommitted-files-causing-data-duplication", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou had a network issue (or similar) while a write operation was in progress. You are rerunning the job, but partially uncommitted files during the failed run are causing unwanted data duplication.\nCause\nHow Databricks commit protocol works:\nThe DBIO commit protocol (\nAWS\n|\nAzure\n|\nGCP\n) is transactional. Files are only committed after a transaction successfully completes. If the job fails in the middle of a transaction, only\n_started_\nand other partially written data files are stored.\nWhen the job is rerun, a new\n_started_\nfile is created. Once the transaction is successfully completed a new\n_committed_\nfile is generated. This\n_committed_\nfile is a JSON file that contains all the parquet file names to be read by the upstream.\nIf you read the folder using Apache Spark there are no duplicates as it only reads the files which are inside\n_committed_\n.\nTo delete the uncommitted data files from the target path, DBIO runs\nVACUUM\nat the end of every job. By default, uncommitted files older than 48 hours (2 days) are removed.\nWhen the issue occurs:\nIf you read the folder within two days of the failed job, using another tool (which does not use DBIO or Spark) or read the folder with a wildcard (\nspark.read.load('/path/*')\n), all the files are read, including the uncommitted files. This results in data duplication.\nSolution\nThe ideal solution is to only use Spark or DBIO to access file storage.\nIf you must preserve access for other tools, you should update the value of\nspark.databricks.io.directoryCommit.vacuum.dataHorizonHours\nin your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nYou can also update this property in a notebook:\nspark.conf.set(\"spark.databricks.io.directoryCommit.vacuum.dataHorizonHours\",\"\")\nThis property determines which files are deleted when the automatic\nVACCUM\nruns at the end of every job. Any file older than the time specified is removed.\nThe default value is 48 hours (2 days). You can reduce this to as little as one hour, depending on your specific needs. If you set the value to one hour, the automatic\nVACCUM\nremoves any uncommitted files older than one hour at the end of every job.\nAlternatively, you can run VACUUM manually after rerunning a failed job with a\nRETAIN HOURS\nvalue low enough to remove the partially uncommitted files..\nPlease review the\nVACUUM\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information.\nSee vacuum documentation:\nhttps://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html\nDelete\nWarning\nRunning\nVACUUM\nwith\nRETAIN HOURS\nset to\n0\ncan cause data consistency issues. If any other Spark jobs are writing files to this folder, running\nVACUUM\nwith\nRETAIN 0 HOURS\ndeletes those files. In general,\nVACUUM\nshould not have a\nRETAIN HOURS\nvalue smaller than\n1\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/understanding-speculative-execution.json b/scraped_kb_articles/understanding-speculative-execution.json new file mode 100644 index 0000000000000000000000000000000000000000..54408a906960926d0f71cba2289da6f8a7b84926 --- /dev/null +++ b/scraped_kb_articles/understanding-speculative-execution.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/understanding-speculative-execution", + "title": "Título do Artigo Desconhecido", + "content": "Speculative execution\nSpeculative execution can be used to automatically re-attempt a task that is not making progress compared to other tasks in the same stage.\nThis means if one or more tasks are running slower in a stage, they will be re-launched. The task that completes first is marked as successful. The other attempt gets killed.\nImplementation\nWhen a job hangs intermittently and one or more tasks are hanging, enabling speculative execution is often the first step to resolving the issue. As a result of speculative execution, the slow hanging task that is not progressing is re-attempted in another node.\nIt means that if one or more tasks are running slower in a stage, the tasks are relaunched. Upon successful completion of the relaunched task, the original task is marked as failed. If the original task completes before the relaunched task, the original task attempt is marked as successful and the relaunched task is killed.\nIn addition to the speculative execution, there are a few additional settings that can be tweaked as needed. Speculative execution should only be enabled when necessary.\nBelow are the major configuration options for speculative execution.\nConfiguration\nDescription\nDatabricks Default\nOSS Default\nspark.speculation\nIf set to\ntrue\n, performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.\nfalse\nfalse\nspark.speculation.interval\nHow often Spark will check for tasks to speculate.\n100ms\n100ms\nspark.speculation.multiplier\nHow many times slower a task is than the median to be considered for speculation.\n3\n1.5\nspark.speculation.quantile\nFraction of tasks which must be complete before speculation is enabled for a particular stage\n0.9\n0.75\nHow to interpret the Databricks default values\nIf speculative execution is enabled (\nspark.speculation\n), then every 100 ms (\nspark.speculation.interval\n), Apache Spark checks for slow running tasks. A task is marked as a slow running task if it is running more than three times longer (\nspark.speculation.multiplier\n) the median execution time of completed tasks. Spark waits until 90% (\nspark.speculation.quantile\n) of the tasks have been completed before starting speculative execution.\nIdentifying speculative execution in action\nReview the task attempts in the Spark UI. If speculative execution is running, you see one task with the\nStatus\nas\nSuccess\nand the other task with a\nStatus\nof\nTaskKilled\n.\nSpeculative execution will not always start, even though there are slow tasks. This is because the criteria for speculative execution must be met before it starts running. This typically happens on stages with a small number of tasks, with only one or two tasks getting stuck. If the\nspark.speculation.quantile\nis not met, speculative execution does not start.\nWhen to enable speculative execution\nSpeculative execution can be used to unblock a Spark application when a few tasks are running for longer than expected and the cause is undetermined. Once a root cause is determined, you should resolve the underlying issue and disable speculative execution.\nSpeculative execution ensures that the speculated tasks are not scheduled on the same executor as the original task. This means that issues caused by a bad VM instance are easily mitigated by enabling speculative execution.\nWhen not to run speculative execution\nSpeculative execution should not be used for a long time period on production jobs for a long time period. Extended use can result in failed tasks.\nIf the operations performed in the task are not idempotent, speculative execution should not be enabled.\nIf you have data skew, the speculated task can take as long as the original task, leaving the original task to succeed and the speculated task to get killed. Speculative execution does not guaranteed the speculated task will finish first.\nEnabling speculative execution can impact performance so it should only be used for troubleshooting. If you require speculative execution to complete your workloads, open a\nDatabricks support request\n. Databricks support can help determine the root cause of the task slowness." +} \ No newline at end of file diff --git a/scraped_kb_articles/unknown-apache-spark-internal-error-when-running-delta-table-queries.json b/scraped_kb_articles/unknown-apache-spark-internal-error-when-running-delta-table-queries.json new file mode 100644 index 0000000000000000000000000000000000000000..a61166178b3276b13cfbc6333b3b6eeaa2fe2817 --- /dev/null +++ b/scraped_kb_articles/unknown-apache-spark-internal-error-when-running-delta-table-queries.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/unknown-apache-spark-internal-error-when-running-delta-table-queries", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile performing a query on a Delta Table, you encounter an error.\n[INTERNAL_ERROR] The Spark SQL phase planning failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.\nThis error may appear consistently for a specific table, even when limiting the number of rows in the\nSELECT\nstatement to a small number.\nCause\nThis is a generic Apache Spark error message that does not indicate a specific cause. In this case, we can confirm the cause by checking the full stack trace in the driver's log4j logs.\n(…)\r\nCaused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:\r\n    Partition column name list #0: partition_date\r\n    Partition column name list #1: partition_date, partition_date\r\nFor partitioned table directories, data files should only live in leaf directories.\r\nAnd directories at the same level should have the same partition column name.\r\nPlease check the following directories for unexpected files or inconsistent partition column names:\r\n(…)\nThe root cause in this case is a problematic hierarchy of the source table's folders/files in a specific partition. Spark expects a specific folder structure for partitioned tables, and any deviation from this structure, even for a single partition, can lead to internal errors. If this is the case, the error only appears when the problematic partition is being consulted.\nSolution\nReorganize the folder structure for the specific partition causing the problem. In general:\nEnsure that there are no parquet files in non-leaf directories.\nFor partitioned table directories, data files should only live in leaf directories.\nEnsure that all partitions of the same name are in the same level.\nDirectories at the same level should have the same partition column name.\nExample\nThe following directory list has unexpected files or inconsistent partition column names.\ndbfs:/your-table/partition_date=2022-01-01\r\ndbfs:/your-table/partition_date=2022-01-02\r\n(...)\r\ndbfs:/your-table/partition_date=2022-09-03/partition_date=2022-09-02\r\nat scala.Predef$.assert(Predef.scala:223)\r\nat org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:576)\r\n(...)\nIn this case, the last partition listed has a problem.\ndbfs:/your-table/partition_date=2022-09-03/partition_date=2022-09-02\nThis folder should be broken down into:\ndbfs:/your-table/partition_date=2022-09-02\r\ndbfs:/your-table/partition_date=2022-09-03\nTo prevent similar issues in the future, ensure that the folder structure for partitioned tables follows Spark's expectations." +} \ No newline at end of file diff --git a/scraped_kb_articles/unknown-host-exception-on-launch.json b/scraped_kb_articles/unknown-host-exception-on-launch.json new file mode 100644 index 0000000000000000000000000000000000000000..9d0cfef971355fcdd7306b4230347f93768daae6 --- /dev/null +++ b/scraped_kb_articles/unknown-host-exception-on-launch.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unknown-host-exception-on-launch", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you launch an Azure Databricks cluster, you get an\nUnknownHostException\nerror.\nYou may also get one of the following error messages:\nError: There was an error in the network configuration. databricks_error_message: Could not access worker artifacts.\nError: Temporary failure in name resolution.\nInternal error message: Failed to launch spark container on instance XXX. Exception: Could not add container for XXX with address X.X.X.X.mysql.database.azure.com: Temporary failure in name resolution.\nCause\nThese errors indicate an issue with DNS settings.\nPrimary DNS could be down or unresponsive.\nArtifacts are not being resolved, which results in the cluster launch failure.\nYou may have a host record listing the artifact public IP as static, but it has changed.\nSolution\nIdentify a working DNS server and update the DNS entry on the cluster.\nStart a\nstandalone Azure VM\nand verify that the artifacts blob storage account is reachable from the instance.\n`telnet dbartifactsprodeastus.blob.core.windows.net 443`.\nVerify that you can reach your primary DNS server from a notebook by running a\nping\ncommand.\nIf your DNS server is not responding, try to reach your secondary DNS server from a notebook by running a\nping\ncommand.\nLaunch a\nWeb Terminal\nfrom the cluster workspace.\nEdit the\n/etc/resolv.conf\nfile on the cluster.\nUpdate the\nnameserver\nvalue with your working DNS server.\nSave the changes to the file.\nRestart\nsystemd-resolved\n.\n$ sudo systemctl restart systemd-resolved.service\nDelete\nInfo\nThis is a temporary change to the DNS and will be lost on cluster restart. After verifying that the custom DNS settings are correct, you can\nconfigure custom DNS settings using dnsmasq\nto make the change permanent.\nFurther troubleshooting\nIf you are still having DNS issues, you should try the following steps:\nVerify that port 43 (used for whois) and port 53 (used for DNS) are open in your firewall.\nAdd the Azure recursive resolver (168,.63.129.16) to the default DNS forwarder. Review the\nVMs and role instances\ndocumentation for more information.\nVerify that\nnslookup\nresults are identical between your laptop and the default DNS. If there is a mistmatch, your DNS server may have an incorrect host record.\nVerify that everything works with a default Azure DNS server. If it works with Azure DNS, but fails with your custom DNS, your DNS admin should review your DNS server settings." +} \ No newline at end of file diff --git a/scraped_kb_articles/unpin-cluster-configurations-using-the-api.json b/scraped_kb_articles/unpin-cluster-configurations-using-the-api.json new file mode 100644 index 0000000000000000000000000000000000000000..74559149e34163f69eb7e0042add8c83a96a6dba --- /dev/null +++ b/scraped_kb_articles/unpin-cluster-configurations-using-the-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unpin-cluster-configurations-using-the-api", + "title": "Título do Artigo Desconhecido", + "content": "Normally, cluster configurations are automatically deleted 30 days after the cluster was last terminated.\nIf you want to keep specific cluster configurations, you can\npin them\n. Up to 100 clusters can be pinned.\nIf you not longer need a pinned cluster, you can unpin it. If you have pinned 100 clusters, you must unpin a cluster before you can pin another one.\nDelete\nNote\nYou must be a Databricks administrator to unpin a cluster.\nYou can easily unpin a cluster (\nAWS\n|\nAzure\n|\nGCP\n) via the workspace UI, but if you are managing your clusters via the API, you can also use the Unpin endpoint (\nAWS\n|\nAzure\n|\nGCP\n) in the Clusters API.\nInstructions\nUnpin all pinned clusters\nUse the following sample code to unpin all pinned clusters in your workspace.\nBefore running the sample code, you will need a personal access token (\nAWS\n|\nAzure\n|\nGCP\n) and your workspace domain. The workspace domain is just the domain name.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nRun the cell to unpin all pinned clusters in your workspace.\n%python\r\n\r\nimport requests\r\nworkspace_url = \"\"\r\naccess_token = \"\"\r\n\r\nurl = workspace_url + \"/api/2.0/clusters/list\"\r\n\r\nheaders = {\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\ncluster = requests.request(\"GET\", url, headers=headers).json()\r\nfor pinned in cluster[\"clusters\"]:\r\n if 'pinned_by_user_name' in pinned :\r\n print(\"Unpinning\"+\" , \"+ pinned[\"default_tags\"]['ClusterName'])\r\n url = workspace_url + \"/api/2.0/clusters/unpin\"\r\n requests.post(url,json={\"cluster_id\" : pinned[\"cluster_id\"]},headers=headers)\nUnpin a cluster by name\nUse the following sample code to unpin a specific cluster in your workspace.\nBefore running the sample code, you will need a personal access token and your workspace domain. The workspace domain is just the domain name.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nUpdate the\n\nvalue with the name of the cluster you want to pin.\nRun the cell to unpin the selected cluster in your workspace.\n%python\r\n\r\nimport requests\r\nworkspace_url = \"\"\r\naccess_token = \"\"\r\n\r\nurl = workspace_url + \"/api/2.0/clusters/list\"\r\n\r\nheaders = {\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\ncluster = requests.request(\"GET\", url, headers=headers).json()\r\nfor pinned in cluster[\"clusters\"]:\r\n if 'pinned_by_user_name' in pinned :\r\n  if pinned[\"default_tags\"]['ClusterName'] == \"\" :\nprint(\"Unpinning\"+\" , \"+ pinned[\"default_tags\"]['ClusterName'])\r\n            url = workspace_url + \"/api/2.0/clusters/unpin\"\r\n            requests.post(url,json={\"cluster_id\" : pinned[\"cluster_id\"]},headers=headers)\nUnpin all clusters by a specific user\nUse the following sample code to unpin a specific cluster in your workspace.\nBefore running the sample code, you will need a personal access token and your workspace domain. The workspace domain is just the domain name.\nCopy and paste the sample code into a notebook cell.\nUpdate the\n\nand\n\nvalues.\nUpdate the\n\nvalue with the name of the user whose clusters you want to unpin.\nRun the cell to unpin the selected clusters in your workspace.\n%python\r\n\r\nimport requests\r\nworkspace_url = \"\"\r\naccess_token = \"\"  \r\n\r\nurl = workspace_url + \"/api/2.0/clusters/list\"\r\n\r\nheaders={\r\n 'Authorization': 'Bearer ' + access_token\r\n}\r\n\r\ncluster = requests.request(\"GET\", url, headers=headers).json()\r\n\r\nfor pinned in cluster[\"clusters\"]:\r\n     if 'pinned_by_user_name' in pinned :\r\n        if pinned[\"creator_user_name\"] == \"\" :\r\n            url = workspace_url + \"/api/2.0/clusters/unpin\"\r\n            requests.post(url,json={\"cluster_id\" : pinned[\"cluster_id\"]},headers=headers)\r\n            print(\"Unpinning\"+\" , \"+ pinned[\"default_tags\"]['ClusterName'])" +} \ No newline at end of file diff --git a/scraped_kb_articles/unresolvable-table-valued-function-error-when-trying-to-execute-create-or-replace-on-a-table-with-read-and-write-permissions.json b/scraped_kb_articles/unresolvable-table-valued-function-error-when-trying-to-execute-create-or-replace-on-a-table-with-read-and-write-permissions.json new file mode 100644 index 0000000000000000000000000000000000000000..3229d92e5fd91704b1a9efe517417d1ca3ac4455 --- /dev/null +++ b/scraped_kb_articles/unresolvable-table-valued-function-error-when-trying-to-execute-create-or-replace-on-a-table-with-read-and-write-permissions.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unresolvable-table-valued-function-error-when-trying-to-execute-create-or-replace-on-a-table-with-read-and-write-permissions", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are attempting to execute a function on a table in Unity Catalog when you encounter an\nUnresolvable Table Valued Function\nerror. You have read and write permissions on the table, but you did not create it. The same function works on tables that you created.\nCause\nWhen you use a\nCREATE OR REPLACE\ncommand, it triggers a\nDROP TABLE\ncommand in the backend.\nDROP TABLE\nrequires you to have owner permissions for the table. If you do not have owner permissions, the\nDROP\noperation fails.\nNote\nIn Unity Catalog, having all privileges on a table is not the same as ownership of the table. Ownership includes the ability to grant privileges to others and to drop objects from a table.\nSolution\nRequest the owner to perform the operation for you, or to grant you ownership.\nFor more information, refer to the\nDROP TABLE\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor more information on Unity Catalog object ownership, refer to the\nManage Unity Catalog object ownership\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unresolved-column-error-when-using-apache-spark-connect-to-run-a-query-to-create-a-temporary-view.json b/scraped_kb_articles/unresolved-column-error-when-using-apache-spark-connect-to-run-a-query-to-create-a-temporary-view.json new file mode 100644 index 0000000000000000000000000000000000000000..31dd287144e035cabec277c05a64869e86367871 --- /dev/null +++ b/scraped_kb_articles/unresolved-column-error-when-using-apache-spark-connect-to-run-a-query-to-create-a-temporary-view.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unresolved-column-error-when-using-apache-spark-connect-to-run-a-query-to-create-a-temporary-view", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen creating a new temporary view using Apache Spark Connect you encounter an issue.\n[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `col1` cannot be resolved. Did you mean one of the following? [`test_col`]. SQLSTATE: 42703\nThis error happens even when you know that the column exists and can be resolved.\nExample code\nIn the following code, the view is defined and then redefined, based on the underlying query.\ndf = spark.sql(\"select 'test' as col1\")\r\n\r\n#create the temporary view\r\ndf.createOrReplaceTempView('temp_view')\r\n\r\n#use the temporary view and saving with the same name\r\ndf = spark.sql(\"select col1 as test_col from temp_view\")\r\ndf.createOrReplaceTempView('temp_view')\r\n\r\ndf.count()\nCause\nTemporary views in Spark Connect are lazily analyzed, which means that if there is a change to the temporary view, the change is not validated until the temporary view is called.\nUpon being called, the temporary view is evaluated and updated. In this case, as the temporary view was recreated, it does not have reference of previous versions of the temporary view, including columns previously defined. This results in the unresolved column error.\nSolution\nWhen working with temporary views, use unique names for each temporary view.\nIf possible, consider using DataFrames instead of temporary views. For more information, refer to the\nTutorial: Load and transform data using Apache Spark DataFrames\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unsupported-path-error-when-creating-an-external-table.json b/scraped_kb_articles/unsupported-path-error-when-creating-an-external-table.json new file mode 100644 index 0000000000000000000000000000000000000000..7326d203ed32dc738d8e042250980be416b363b5 --- /dev/null +++ b/scraped_kb_articles/unsupported-path-error-when-creating-an-external-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/unsupported-path-error-when-creating-an-external-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to create an external table with the following SQL query, you encounter an error indicating there is an issue with the specified path when trying to create the table.\n%sql\r\nCREATE OR REPLACE TABLE .\r\n(\r\n     ,\r\n     ,\r\n    …\r\n)\r\nUSING delta\r\nLOCATION 's3:////'\nNote\nThe error message's\nLOCATION\nmay vary based on the cloud provider you are using.\nFor Azure:\nabfss://@.dfs.core.windows.net//\nFor GCP:\ngs:////\nThe error message displayed is:\nUnsupported path operation PATH_CREATE_TABLE on volume.\nCause\nThe paths used for the external location and the external volume overlap.\nFor context, the conflict arises because external volumes and external tables serve different purposes. External volumes are used to manage and organize data storage, while external tables are used to query data stored in external locations. Creating volumes with the same path as the external location prevents you from adding additional Unity Catalog (UC) entities, such as tables or volumes, under that external location.\nSolution\nCreate the table in a different location that does not overlap with the external volume.\nDelete the existing external volume that has the same path as the external location.\nCreate a new external volume with a different path that does not overlap with the external location.\nCreate the external table using the following query:\n%sql\r\nCREATE OR REPLACE TABLE .\r\n(\r\n  ,\r\n  ,\r\n …\r\n)\r\nUSING delta\r\nLOCATION 's3:////'\nTo avoid similar issues in the future, Databricks recommends using unique paths for external volumes and external locations. For more information, please refer to the\nHow do paths work for data managed by Unity Catalog?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/unsupportedclassversionerror-when-running-a-databricks-job-on-graviton-machines.json b/scraped_kb_articles/unsupportedclassversionerror-when-running-a-databricks-job-on-graviton-machines.json new file mode 100644 index 0000000000000000000000000000000000000000..49cb53e868f0bed3cdc2dd7f6204346fbac9d89b --- /dev/null +++ b/scraped_kb_articles/unsupportedclassversionerror-when-running-a-databricks-job-on-graviton-machines.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/unsupportedclassversionerror-when-running-a-databricks-job-on-graviton-machines", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou encounter an\nUnsupportedClassVersionError\nwhen running a Databricks job on Graviton machines, despite the job running successfully with Java Virtual Machine (JVM) 17 on non-Graviton machines.\nDuring execution you see an error pointing to compute using JVM 8. You install the JVM package on your cluster, but still receive an error.\nUnsupportedClassVersionError: com//soon_oss/streaming/ingestion/config/common/EnvironmentConfig/EnvironmentConfig has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 52.0\nCause\nYou’re using an AMD JVM package on your Graviton machine: J\nNAME=zulu17-ca-amd64\n. Graviton machines use ARM architecture, not AMD.\nSolution\n1. Specify an ARM architecture instead. Refer to the\nDatabricks SDK for Java\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\n2. Update the JVM configuration. Change the JNAME parameter to use an ARM-compatible Java distribution.\n%sh\r\nJNAME=zulu17-ca-arm64\n3. After updating the JVM configuration, rerun the Databricks job on the Graviton machine." +} \ No newline at end of file diff --git a/scraped_kb_articles/unsupportedoperationexception-error-when-trying-to-run-queries-interacting-with-event-logs-and-multiple-tables.json b/scraped_kb_articles/unsupportedoperationexception-error-when-trying-to-run-queries-interacting-with-event-logs-and-multiple-tables.json new file mode 100644 index 0000000000000000000000000000000000000000..6b02f02ca4ccd79d3321be8c2a8a8aff4dc6e324 --- /dev/null +++ b/scraped_kb_articles/unsupportedoperationexception-error-when-trying-to-run-queries-interacting-with-event-logs-and-multiple-tables.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/unsupportedoperationexception-error-when-trying-to-run-queries-interacting-with-event-logs-and-multiple-tables", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to run queries involving interaction with event logs and multiple tables within the same notebook cell on an interactive cluster, you receive an error message.\nUnsupportedOperationException: Cannot read more than one event logs in the same query\nCause\nInteractive clusters run Spark-Connect, which allows you to execute DataFrame operations or queries on Apache Spark clusters from remote environments like IDEs, notebooks, or applications.\nSpark-Connect does not allow commands which query multiple tables within the same cell. As a result, when Spark-Connect encounters such queries, it processes the entire cell as a single query instead.\nSolution\nEither split queries into separate notebook cells, or use serverless compute.\nSplit queries into separate cells\nThe following example demonstrates how to split the queries so each notebook cell is treated as a separate query.\nCell 1\nsql = f\"CREATE OR REPLACE TABLE {}.event_log_{.} AS SELECT * FROM event_log(table({}.{.}))\"\nCell 2\nsql = f\"CREATE OR REPLACE TABLE {}.event_log_{.} AS SELECT * FROM event_log(table({}.{.}))\"\nUse serverless compute\nAlternatively, you can use serverless compute. In serverless, Spark-Connect executes queries independently. For more information, refer to the\nServerless compute release notes\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCell 1\ntables = [, , ]  \r\nfor table in tables:\r\n    sql = f\"create or replace table {}.event_log_{.} as select * from event_log(table({}.{.}))\"\r\n    spark.sql(sql)" +} \ No newline at end of file diff --git a/scraped_kb_articles/update-job-perms-multiple-users.json b/scraped_kb_articles/update-job-perms-multiple-users.json new file mode 100644 index 0000000000000000000000000000000000000000..72d9f1b526f0df0c5c4a74b35bfce1289f83ad3f --- /dev/null +++ b/scraped_kb_articles/update-job-perms-multiple-users.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/update-job-perms-multiple-users", + "title": "Título do Artigo Desconhecido", + "content": "When you are running jobs, you might want to update user permissions for multiple users.\nYou can do this by using the Databricks job permissions API (\nAWS\n|\nAzure\n|\nGCP\n) and a bit of Python code.\nInstructions\nCopy the example code into a notebook.\nEnter the\n\n(or multiple job ids) into the array\narr[]\n.\nEnter your\npayload{}\n. In this example, we are using the\n\nand\n\nthat we want to grant.\nEnter the\n\ninto the\nurl\nfield.\nEnter the\n\nunder\nBearer\n.\nRun the notebook cell with the updated code.\nIf the update is successful, the code returns a response of\n200 (OK)\n.\nExample code\n%python\r\n\r\nimport requests\r\nimport json\r\n\r\narr=[,]\r\nfor j in arr :\r\n  def requestcall():\r\n      payload = {\"access_control_list\": [{\"user_name\": \"\",\"permission_level\": \"\"}]}\r\n      url='https:///api/2.0/permissions/jobs/'+str(j)\r\n      myResponse = requests.patch(url=url, headers={'Authorization': 'Bearer '}, verify=True, data=json.dumps(payload))\r\n      print(myResponse.status_code)\r\n      print(myResponse.content)\r\n        # If the API call is successful, the response code is 200 (OK).\r\n      if myResponse.ok:\r\n            # Extracting data in JSON format.\r\n       data = myResponse.json()\r\n       return data\r\n  requestcall()" +} \ No newline at end of file diff --git a/scraped_kb_articles/update-nested-column.json b/scraped_kb_articles/update-nested-column.json new file mode 100644 index 0000000000000000000000000000000000000000..9844283766084958eba1c478b7c5db0e36242a92 --- /dev/null +++ b/scraped_kb_articles/update-nested-column.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/update-nested-column", + "title": "Título do Artigo Desconhecido", + "content": "Spark doesn’t support adding new columns or dropping existing columns in nested structures. In particular, the\nwithColumn\nand\ndrop\nmethods of the\nDataset\nclass don’t allow you to specify a column name different from any top level columns. For example, suppose you have a dataset with the following schema:\n%scala\r\n\r\nval schema = (new StructType)\r\n      .add(\"metadata\",(new StructType)\r\n             .add(\"eventid\", \"string\", true)\r\n             .add(\"hostname\", \"string\", true)\r\n             .add(\"timestamp\", \"string\", true)\r\n           , true)\r\n      .add(\"items\", (new StructType)\r\n             .add(\"books\", (new StructType).add(\"fees\", \"double\", true), true)\r\n             .add(\"paper\", (new StructType).add(\"pages\", \"int\", true), true)\r\n           ,true)\r\nschema.treeString\nThe schema looks like:\nroot\r\n |-- metadata: struct (nullable = true)\r\n |    |-- eventid: string (nullable = true)\r\n |    |-- hostname: string (nullable = true)\r\n |    |-- timestamp: string (nullable = true)\r\n |-- items: struct (nullable = true)\r\n |    |-- books: struct (nullable = true)\r\n |    |    |-- fees: double (nullable = true)\r\n |    |-- paper: struct (nullable = true)\r\n |    |    |-- pages: integer (nullable = true)\nSuppose you have the\nDataFrame\n:\n%scala\r\n\r\nval rdd: RDD[Row] = sc.parallelize(Seq(Row(\r\n  Row(\"eventid1\", \"hostname1\", \"timestamp1\"),\r\n  Row(Row(100.0), Row(10)))))\r\nval df = spark.createDataFrame(rdd, schema)\r\ndisplay(df)\nYou want to increase the\nfees\ncolumn, which is nested under\nbooks\n, by 1%. To update the\nfees\ncolumn, you can reconstruct the dataset from existing columns and the updated column as follows:\n%scala\r\n\r\nval updated = df.selectExpr(\"\"\"\r\n    named_struct(\r\n        'metadata', metadata,\r\n        'items', named_struct(\r\n          'books', named_struct('fees', items.books.fees * 1.01),\r\n          'paper', items.paper\r\n        )\r\n    ) as named_struct\r\n\"\"\").select($\"named_struct.metadata\", $\"named_struct.items\")\r\nupdated.show(false)\nThen you will get the result:\n+-----------------------------------+-----------------+\r\n| metadata                          | items           |\r\n+===================================+=================+\r\n| [eventid1, hostname1, timestamp1] | [[101.0], [10]] |\r\n+-----------------------------------+-----------------+" +} \ No newline at end of file diff --git a/scraped_kb_articles/update-notification-settings-for-jobs-with-the-jobs-api.json b/scraped_kb_articles/update-notification-settings-for-jobs-with-the-jobs-api.json new file mode 100644 index 0000000000000000000000000000000000000000..c9a4fe0a02a898320dff40e7c8e4d615aac48b78 --- /dev/null +++ b/scraped_kb_articles/update-notification-settings-for-jobs-with-the-jobs-api.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/update-notification-settings-for-jobs-with-the-jobs-api", + "title": "Título do Artigo Desconhecido", + "content": "Email notifications can be useful when managing multiple jobs. If you have many jobs configured without notifications, manually adding notifications can be time consuming. Instead, you can use the\nJobs API\n(\nAWS\n|\nAzure\n|\nGCP\n) to add email notifications to the jobs in your workspace.\nInstructions\nIn order to call the Jobs API, you first need to setup a personal access token and a secret scope. This allows you to interact with the API via a script.\nAfter the secret scopes have been setup, you can run the example script in a notebook to update all the jobs in your workspace at once.\nCreate a Databricks personal access token\nFollow the\nPersonal access tokens for users\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to create a personal access token.\nCreate a secret scope\nFollow the\nCreate a Databricks-backed secret scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to create a secret scope.\nStore your personal access token and your Databricks instance in the secret scope\nFollow the\nCreate a secret in a Databricks-backed scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to store the personal access token you created and your Databricks instance as new secrets within your secret scope.\nYour Databricks instance is the hostname for your workspace, for example, xxxxx.cloud.databricks.com.\nUse a Python script to update job notifications for all the jobs in the workspace\nYou need to replace the following values in the script before running:\n\n- Email address to notify.\n\n- The name of your scope that holds the secrets.\n\n- The name of the secret that holds your Databricks instance.\n\n- The name of the secret that holds your personal access token.\nimport json\r\nimport requests\r\n\r\nAPI_URL = dbutils.secrets.get(scope = \"\", key = \"\") \r\nTOKEN = dbutils.secrets.get(scope = \"\", key = \"\")  \r\nurl = f\"{API_URL}/api/2.0/jobs/list\" #Get all the jobs created inside the workspace\r\npayload={}\r\nheaders = {\r\n  'Authorization': 'Bearer ' + TOKEN\r\n}\r\nresponse = requests.request(\"GET\", url, headers=headers, data=payload)\r\nresponse.json()\r\nfor job in response.json()['jobs']:\r\n    job_id = job[\"job_id\"]\r\n    payload = {\r\n\"job_id\": job_id,\r\n\"new_settings\": {\r\n\"email_notifications\": {\r\n\"on_start\": [\r\n\"\" #user's email ID\r\n],\r\n\"on_success\": [\r\n\"\" \r\n],\r\n\"on_failure\": [\r\n\"\"\r\n]\r\n}\r\n}\r\n    }\r\n    url2 = f\"{API_URL}/api/2.1/jobs/update\"\r\n    r = requests.post(url2, data = json.dumps(payload), headers=headers)\r\nprint(\"successfully added the email_notification to jobs\")\nDelete\nInfo\nYou can modify the sample script to add additional filtering options if you don't want to add notifications to all jobs. For example, you can filter based on the job creator and only add notifications to the filtered jobs.\nThis version of the sample code adds an if condition, checking an email address against the value\ncreator_user_name\n. This filters the jobs based on the job creator.\nReplace\n\nwith the email address you want to filter on.\nimport json\r\nimport requests\r\n\r\n\r\nAPI_URL = dbutils.secrets.get(scope = \"\", key = \"\") \r\nTOKEN = dbutils.secrets.get(scope = \"\", key = \"\")  \r\nurl = f\"{API_URL}/api/2.0/jobs/list\"  #Get all the jobs created inside the workspace\r\npayload={}\r\nheaders = {\r\n  'Authorization': 'Bearer ' + TOKEN\r\n}\r\nresponse = requests.request(\"GET\", url, headers=headers, data=payload)\r\nresponse.json()\r\nfor job in response.json()['jobs']:\r\n    if job['creator_user_name']==\"\":  # filtering the jobs based on the job creator\r\n        job_id = job[\"job_id\"]\r\n        payload = {\r\n        \"job_id\": job_id,\r\n        \"new_settings\": {\r\n        \"email_notifications\": {\r\n        \"on_start\": [\r\n        \"\"  #user's email ID\r\n                      ],\r\n        \"on_success\": [\r\n        \"\"   \r\n                    ],\r\n        \"on_failure\": [\r\n        \"\"\r\n                       ]\r\n                    }\r\n                        }\r\n                    }\r\n        url2 = f\"{API_URL}/api/2.1/jobs/update\"\r\n        r = requests.post(url2, data = json.dumps(payload), headers=headers)\r\nprint(\"successfully added the email_notification to jobs\")\nVerify status in the Job UI\nOnce the code runs successfully you can verify the updated notification status by checking your jobs in the Job UI.\nIn the left nav menu, click on\nWorkflows\n.\nClick the name of a job you want to verify.\nOn the right side of the job details page, scroll down to the\nNotifications\nsection.\nThe email address you added to the sample script is now present and configured for notifications." +} \ No newline at end of file diff --git a/scraped_kb_articles/update-query-fails.json b/scraped_kb_articles/update-query-fails.json new file mode 100644 index 0000000000000000000000000000000000000000..bfe186d5748d4ed9a27dfca41862a0624cc265d3 --- /dev/null +++ b/scraped_kb_articles/update-query-fails.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/update-query-fails", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you execute a Delta Lake\nUPDATE\n,\nDELETE\n, or\nMERGE\nquery that uses Python UDFs in any of its transformations, it fails with the following exception:\nAWS\njava.lang.UnsupportedOperationException: Error in SQL statement:\r\nIllegalStateException: File (s3a://xxx/table1) to be rewritten not found among candidate files:\r\ns3a://xxx/table1/part-00001-39cae1bb-9406-49d2-99fb-8c865516fbaa-c000.snappy.parquet\nDelete\nAzure\njava.lang.UnsupportedOperationException: Error in SQL statement:\r\nIllegalStateException: File (adl://xxx/table1) to be rewritten not found among candidate files:\r\nadl://xxx/table1/part-00001-39cae1bb-9406-49d2-99fb-8c865516fbaa-c000.snappy.parquet\nDelete\nVersion\nThis problem occurs on Databricks Runtime 5.5 and below.\nCause\nDelta Lake internally depends on the\ninput_file_name()\nfunction for operations like\nUPDATE\n,\nDELETE\n, and\nMERGE\n.\ninput_file_name()\nreturns an empty value if you use it in a\nSELECT\nstatement that evaluates a Python UDF.\nUPDATE\ncalls\nSELECT\ninternally, which then fails to return file names and leads to the error. This error does not occur with Scala UDFs.\nSolution\nYou have two options:\nUse Databricks Runtime 6.0 or above, which includes the resolution to this issue:\n[SPARK-28153]\n.\nIf you can’t use Databricks Runtime 6.0 or above, use Scala UDFs instead of Python UDFs." +} \ No newline at end of file diff --git a/scraped_kb_articles/update-the-databricks-sql-warehouse-owner.json b/scraped_kb_articles/update-the-databricks-sql-warehouse-owner.json new file mode 100644 index 0000000000000000000000000000000000000000..d14f8484433bdb089bd5d9d7dc2f1c62e6f508b2 --- /dev/null +++ b/scraped_kb_articles/update-the-databricks-sql-warehouse-owner.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/update-the-databricks-sql-warehouse-owner", + "title": "Título do Artigo Desconhecido", + "content": "Whoever creates a SQL warehouse is defined as the owner by default. There may be times when you want to transfer ownership of the SQL warehouse to another user. This can be done by\ntransferring ownership of Databricks SQL objects\n(\nAWS\n|\nAzure\n|\nGCP\n) via the UI or the Permissions REST API.\nInstructions\nInfo\nThe service principal cannot be changed to the owner with this method. If you want to modify the service principal, please reach out to your Databricks representative for further assistance.\nTransfer ownership via UI\nThe ownership of SQL objects can be changed by following the\nTransfer ownership of a SQL warehouse\ndocumentation (\nAWS\n|\nAzure\n|\nGCP\n).\nTransfer ownership via API\nThe public documentation describes how to use PUT requests to change the owner, but this method removes all existing permissions in the warehouse. To preserve the existing permissions, you must perform three specific actions.\nGet the existing permissions for the warehouse.\nParse the response.\nMake a\nPUT\nrequest to change the owner and append existing permissions.\nSample code (Python)\nInfo\nTo get your workspace URL, review\nWorkspace instance names, URLs, and IDs\n(\nAWS\n|\nAzure\n|\nGCP\n).\nReview the\nGenerate a personal access token\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details on how to create a personal access token for use with the REST APIs.\nTo get the\nSQL warehouse ID\nyou should open the SQL warehouse dashboard, click\nSQL Warehouses\nin the sidebar, and select the required warehouse. You will find the warehouse ID value in the browser URL. Look for\nsql/warehouses/?o=workspace_id#\n.\nTo run the sample code, you need to copy and paste each section into a notebook cell before running.\nYou will need to replace\n\n,\n\n,\n\n, and\n\nwith values that are specific to your workspace.\n\nis the Databricks username of the person that you want to transfer ownership to.\nThe sample code is broken up info four key sections. You should complete each section before running the next one.\nSetup variables and import required libraries\nCreate a function to check if a given user exists or not\nGet the existing permissions for the SQL warehouse\nVerify the new owner exists and update the permissions\nSetup variables and import required libraries\nimport json\r\nimport requests\r\n\r\nDATABRICKS_HOST = \"\"\r\nDATABRICKS_TOKEN = \"\"\r\nWAREHOUSE_ID = \"\"\r\nNEW_WAREHOUSE_OWNER = \"\"\r\n\r\nheaders = {\r\n    \"Content-Type\": \"application/json\",\r\n    \"Accept\": \"application/json\",\r\n    \"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\"\r\n}\nCreate a function to check if a given user exists or not\nThis function is used later in the code to verify that the\nNEW_WAREHOUSE_OWNER\nexists in the workspace. It takes a username as an input value and then verifies it against the existing list of users in the workspace.\n### First we check if the user exists in the environment or not.\r\ndef doesUserExist(user):  \r\n    response = requests.request(\r\n        \"GET\",\r\n        f\"{DATABRICKS_HOST}/api/2.0/preview/scim/v2/Users?filter=userName+eq+{user}\",\r\n        headers=headers\r\n    )\r\n    if \"Resources\" in response.json():\r\n        return True\r\n    else:\r\n        return False\nGet the existing permissions for the SQL warehouse\nThis gets the existing permissions on the SQL warehouse.\n### Get existing permissisons\r\nresponse = requests.request(\r\n    \"GET\",\r\n    f\"{DATABRICKS_HOST}/api/2.0/preview/permissions/sql/warehouses/{WAREHOUSE_ID}\",\r\n    headers=headers\r\n)\nVerify the new owner exists and update the permissions\nIf the new owner exists in the workspace, update the permissions payload, and make a PUT request to apply the new permissions.\nIf the new owner doesn't exist in the workspace, print an error message.\nif response.status_code != 200:\r\n raise Exception(f\"Failed to get permissions. Response[{response.status_code}]: {response.text}\")\r\n \r\nif doesUserExist(NEW_WAREHOUSE_OWNER):\r\n print(\"=== Permission BEFORE change ===\\n\", response.text)\r\n existing_permissions = []\r\n for permission in response.json()[\"access_control_list\"]:\r\n if (permission[\"all_permissions\"][0][\"inherited\"] == False):\r\n if (permission[\"all_permissions\"][0][\"permission_level\"] == \"IS_OWNER\"):\r\n if doesUserExist(permission[\"user_name\"]):\r\n existing_permissions.append({\"user_name\": permission[\"user_name\"], \"permission_level\": \"CAN_MANAGE\"})\r\n else:\r\n if \"user_name\" in permission: \r\n key1 = \"user_name\"\r\n else:\r\n key1 = \"group_name\"\r\n existing_permissions.append(\r\n {\r\n key1: permission[key1],\r\n \"permission_level\": permission[\"all_permissions\"][0][\"permission_level\"],\r\n }\r\n )\r\n existing_permissions.append({\"user_name\":NEW_WAREHOUSE_OWNER , \"permission_level\": \"IS_OWNER\"})\r\n \r\n### Make PUT request to change owner and apply existing permissions\r\n payload = json.dumps({\"access_control_list\": existing_permissions})\r\n response = requests.request(\r\n \"PUT\",\r\n f\"{DATABRICKS_HOST}/api/2.0/preview/permissions/sql/warehouses/{WAREHOUSE_ID}\",\r\n headers=headers,\r\n data=payload\r\n )\r\n if response.status_code != 200:\r\n raise Exception(f\"Failed to change permissions. Response[{response.status_code}]: {response.text}\")\r\n print(\"=== Permission AFTER change ===\\n\", response.text)\r\n \r\nelse:\r\n print(NEW_WAREHOUSE_OWNER + \" doesn't exists due to which permission cannot be changed\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/upgrading-to-143-lts-gives-the-error-comdatabrickssqlcloudfileserrorscloudfilesillegalargumentexception.json b/scraped_kb_articles/upgrading-to-143-lts-gives-the-error-comdatabrickssqlcloudfileserrorscloudfilesillegalargumentexception.json new file mode 100644 index 0000000000000000000000000000000000000000..d1cc597735416ea0ae3c7c1dfacf7e93dd9e03b6 --- /dev/null +++ b/scraped_kb_articles/upgrading-to-143-lts-gives-the-error-comdatabrickssqlcloudfileserrorscloudfilesillegalargumentexception.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/upgrading-to-143-lts-gives-the-error-comdatabrickssqlcloudfileserrorscloudfilesillegalargumentexception", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nIf your Autoloader streaming job specifies the schema manually and also using schema evolution, you may encounter an error when upgrading your Auto Loader job configuration from Databricks Runtime 13.3 LTS to 14.3 LTS.\ncom.databricks.sql.cloudfiles.errors.CloudFilesIllegalArgumentException: Schema evolution mode addNewColumns is not supported when the schema is specified. To use this mode, you can provide the schema through cloudFiles.schemaHints instead.\nCause\nAs of Databricks Runtime 14.3 LTS, You can no longer provide both\n.option(\"avroSchema\",your-schema)\nand\n.option(\"cloudFiles.schemaEvolutionMode\",\"addNewColumns\")\nparameters.\nSolution\nChoose to configure manually or via schema evolution.\nFor schema evolution, remove\navroSchema\nand use\n.option(\"cloudFiles.schemaHints\",your-schema)\nFor manual configuration, set\n.option(\"cloudFiles.schemaEvolutionMode\",\"none\")\n.\nFor more information, please review the\nConfigure schema inference and evolution in Auto Loader\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/upload-large-files-using-dbfs-api-20-and-powershell.json b/scraped_kb_articles/upload-large-files-using-dbfs-api-20-and-powershell.json new file mode 100644 index 0000000000000000000000000000000000000000..73d7237fadaa5cfac88b29b2b1d9339e1d100da4 --- /dev/null +++ b/scraped_kb_articles/upload-large-files-using-dbfs-api-20-and-powershell.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbfs/upload-large-files-using-dbfs-api-20-and-powershell", + "title": "Título do Artigo Desconhecido", + "content": "Using the Databricks REST API to interact with your clusters programmatically can be a great way to streamline workflows with scripts.\nThe API can be called with various tools, including PowerShell. In this article, we are going to take a look at an example DBFS\nput\ncommand using curl and then show you how to execute that same command using PowerShell.\nThe DBFS API 2.0\nput\ncommand (\nAWS\n|\nAzure\n) limits the amount of data that can be passed using the contents parameter to 1 MB if the data is passed as a string. The same command can pass 2 GB if the data is passed as a file. It is mainly used for streaming uploads, but can also be used as a convenient single call for data upload.\nCurl example\nThis example uses curl to send a simple multipart form post request to the API to upload a file up to 2 GB in size.\nReplace all of the values in <> with appropriate values for your environment.\nDelete\nInfo\nTo get your workspace URL, review\nWorkspace instance names, URLs, and IDs\n(\nAWS\n|\nAzure\n).\nReview the\nGenerate a personal access token\n(\nAWS\n|\nAzure\n) documentation for details on how to create a personal access token for use with the REST APIs.\n# Parameters\r\ndatabricks_workspace_url=\"\"\r\npersonal_access_token=\"\"\r\nlocal_file_path=\"\"              # ex: /Users/foo/Desktop/file_to_upload.png\r\ndbfs_file_path=\"\"                # ex: /tmp/file_to_upload.png\r\noverwrite_file=\"\"\r\n\r\n\r\ncurl --location --request POST https://${databricks_workspace_url}/api/2.0/dbfs/put \\\r\n     --header \"Authorization: Bearer ${personal_access_token}\" \\\r\n     --form contents=@${local_file_path} \\\r\n     --form path=${dbfs_file_path} \\\r\n     --form overwrite=${overwrite_file}\nPowerShell example\nThis PowerShell example is longer than the curl example, but it sends the same multipart form post request to the API.\nThe below script can be used in any environment where\nPowerShell is supported\n.\nTo run the PowerShell script you must:\nReplace all of the values in\n<>\nwith appropriate values for your environment. Review the DBFS API 2.0\nput\ndocumentation for more information.\nSave the script as a\n.ps1\nfile. For example, you could call it\nupload_large_file_to_dbfs.ps1\n.\nExecute the script in PowerShell by running\n./upload_large_file_to_dbfs.ps1\nat the prompt.\n################################################## Parameters\r\n$DBX_HOST = \"\"\r\n$DBX_TOKEN = \"\"\r\n$FILE_TO_UPLOAD = \"\"      # ex: /Users/foo/Desktop/file_to_upload.png  \r\n$DBFS_PATH = \"\"            # ex: /tmp/file_to_upload.png\r\n$OVERWRITE_FILE = \"\"\r\n##################################################\r\n\r\n\r\n# Configure authentication\r\n$headers = New-Object \"System.Collections.Generic.Dictionary[[String],[String]]\"\r\n$headers.Add(\"Authorization\", \"Bearer \"  + $DBX_TOKEN)\r\n\r\n$multipartContent = [System.Net.Http.MultipartFormDataContent]::new()\r\n\r\n# Local file path\r\n$FileStream = [System.IO.FileStream]::new($FILE_TO_UPLOAD, [System.IO.FileMode]::Open)\r\n$fileHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new(\"form-data\")\r\n$fileHeader.Name = $(Split-Path $FILE_TO_UPLOAD -leaf)\r\n$fileHeader.FileName = $(Split-Path $FILE_TO_UPLOAD -leaf)\r\n$fileContent = [System.Net.Http.StreamContent]::new($FileStream)\r\n$fileContent.Headers.ContentDisposition = $fileHeader\r\n$fileContent.Headers.ContentType = [System.Net.Http.Headers.MediaTypeHeaderValue]::Parse(\"text/plain\")\r\n$multipartContent.Add($fileContent)\r\n\r\n\r\n# DBFS path\r\n$stringHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new(\"form-data\")\r\n$stringHeader.Name = \"path\"\r\n$stringContent = [System.Net.Http.StringContent]::new($DBFS_PATH)\r\n$stringContent.Headers.ContentDisposition = $stringHeader\r\n$multipartContent.Add($stringContent)\r\n\r\n\r\n# File overwrite config\r\n$stringHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new(\"form-data\")\r\n$stringHeader.Name = \"overwrite\"\r\n$stringContent = [System.Net.Http.StringContent]::new($OVERWRITE_FILE)\r\n$stringContent.Headers.ContentDisposition = $stringHeader\r\n$multipartContent.Add($stringContent)\r\n\r\n\r\n# Call Databricks DBFS REST API\r\n$body = $multipartContent\r\n$uri = 'https://' + $DBX_HOST + '/api/2.0/dbfs/put'\r\n$response = Invoke-RestMethod $uri -Method 'POST' -Headers $headers -Body $body\r\n$response | ConvertTo-Json\nDelete\nInfo\nYou can use PowerShell scripts in Linux and OS X as well as Windows. The command to run a PowerShell script is slightly different in those environments. Refer to the PowerShell documentation if you are trying to run the script on a platform other than Windows." +} \ No newline at end of file diff --git a/scraped_kb_articles/uploaded-artifacts-to-volume-using-databricks-asset-bundles-dab-not-appearing.json b/scraped_kb_articles/uploaded-artifacts-to-volume-using-databricks-asset-bundles-dab-not-appearing.json new file mode 100644 index 0000000000000000000000000000000000000000..52e704bd8de080b5b8effa38b31553cc2ab20def --- /dev/null +++ b/scraped_kb_articles/uploaded-artifacts-to-volume-using-databricks-asset-bundles-dab-not-appearing.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/uploaded-artifacts-to-volume-using-databricks-asset-bundles-dab-not-appearing", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Databricks Asset Bundles (DAB) to deploy artifacts, you encounter a situation where artifacts are not uploaded to a given volume as expected.\nThis issue is not accompanied by an error message. When you run the\ndatabricks bundle deploy\ncommand, the deployment completes but the expected path in the volume is not created, and no artifacts appear there.\nCause\nBy default, the\nworkspace.artifact_path\nproperty in the DAB YAML file points to the workspace file system. This property indicates where artifacts should be uploaded.\nVolumes are a separate storage location, but you can edit the\nworkspace.artifact_path\nproperty to point to a volume.\nSolution\nConfigure the\nworkspace.artifact_path\nproperty to your desired volume path in your\ndatabricks.yml\nfile. Replace\n\nwith your respective platform:\nAWS:\ncloud.databricks.com\nAzure:\nazuredatabricks.net\nGCP:\ngcp.databricks.com\nbundle:\r\n      name: \r\n      description: Upload a test .whl artifact to Volumes via DAB\r\n      include:\r\n        - ./artifacts/*\r\n    targets:\r\n      dev:\r\n        workspace:\r\n          host: https://./\r\n          artifact_path: /Volumes/\r\n        artifacts:\r\n          my_whl:\r\n            path: ./\nRun the\ndatabricks bundle deploy\ncommand to deploy your artifacts to the configured volume.\nFor more information, review the “artifact_path” section of the\nDatabricks Asset Bundle configuration\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/use-a-cluster-policy-to-disable-photon.json b/scraped_kb_articles/use-a-cluster-policy-to-disable-photon.json new file mode 100644 index 0000000000000000000000000000000000000000..f6156845693df8ee852d65a70a60f8749d1d0b1b --- /dev/null +++ b/scraped_kb_articles/use-a-cluster-policy-to-disable-photon.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/use-a-cluster-policy-to-disable-photon", + "title": "Título do Artigo Desconhecido", + "content": "Problem:\nYou want to use a cluster policy to prevent users from creating clusters with Photon enabled.\nSolution:\nUse the\nruntime_engine\nparameter in a cluster policy to prevent users from creating clusters with Photon enabled.\nExample code:\n{\r\n \"runtime_engine\": {\r\n \"type\": \"blocklist\",\r\n \"values\": [\r\n \"PHOTON\"\r\n ]\r\n }\r\n}" +} \ No newline at end of file diff --git a/scraped_kb_articles/use-an-azure-ad-service-principal-as-compute-acl.json b/scraped_kb_articles/use-an-azure-ad-service-principal-as-compute-acl.json new file mode 100644 index 0000000000000000000000000000000000000000..9b615c32418b8f87932b01ebf231effb31698ac3 --- /dev/null +++ b/scraped_kb_articles/use-an-azure-ad-service-principal-as-compute-acl.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/use-an-azure-ad-service-principal-as-compute-acl", + "title": "Título do Artigo Desconhecido", + "content": "When granting permissions to a compute cluster (compute access control), it is possible to grant permission to the following entities:\nUsers\nGroups\nService principals (Azure only)\nDelete\nWarning\nBefore you can use compute access control, an administrator must enable it for the workspace. Review\nEnable cluster access control for your workspace\nfor more information. You should also ensure you meet the requirements to use\nSCIM API 2.0 (ServicePrincipals)\n.\nInstructions\nCreate a service principal and add it to your workspace\nOption 1:\nFollow the\nAdd service principal\nAPI documentation to create a service principal and add it to your workspace.\nOption 2:\nRun this example code in a notebook.\n%sh\r\n\r\ncurl --location --request POST 'https://\n/api/2.0/preview/scim/v2/ServicePrincipals'\n; \\\r\n--header 'Authorization: Bearer ' \\\r\n--header 'Content-Type: application/json' \\\r\n--data-raw '{\r\n \"schemas\":[\r\n  \"urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal\"\r\n ],\r\n \"applicationId\":\"\",\r\n \"displayName\":\"”\r\n}'\nReplace the following values before running the example code:\n\n- Your\nAzure Databricks personal access token\n. If you do not have an access token, you will have to create one.\n\n- The Azure application ID of the service principal, for example\n12345a67-xxx-0d1e-23fa-4567b89cde01\n.\n\n- The workspace instance name, for example\nadb-1234567890123456.7.azuredatabricks.net\n.\n\n- The display name of the service principal, for example\nservice-principal-dbuser@azure.com\n.\nAdd the service principal to your compute ACL\nAfter the service principal has been added to your workspace, you have to add it to your compute.\nClick\nCompute\nin the left menu bar.\nClick the name of your compute cluster.\nClick\nMore\n.\nClick\nPermissions\n.\nClick the\nSelect User, Group or Service Principal\ndrop-down.\nSelect the service principal you created in the previous step.\nSelect the permission to assign to the service principal (ex.\nCan Read\n,\nCan Manage\n).\nClick\n+Add\n.\nClick\nSave\n.\nClick\nMore\n.\nClick\nRestart.\nClick\nConfirm\nto restart the compute cluster." +} \ No newline at end of file diff --git a/scraped_kb_articles/use-custom-classes-and-objects-in-a-schema.json b/scraped_kb_articles/use-custom-classes-and-objects-in-a-schema.json new file mode 100644 index 0000000000000000000000000000000000000000..2d3e83e12b6ed96d912aeedcdd9fb4a9ffd079cb --- /dev/null +++ b/scraped_kb_articles/use-custom-classes-and-objects-in-a-schema.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/use-custom-classes-and-objects-in-a-schema", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to create a dataset using a schema that contains Scala enumeration fields (classes and objects). When you run your code in a notebook cell, you get a\nClassNotFoundException\nerror.\nSample code\n%scala\r\n\r\nobject TestEnum extends Enumeration {\r\n type TestEnum = Value\r\n val E1, E2, E3 = Value\r\n}\r\n\r\nimport spark.implicits._\r\nimport TestEnum._\r\n\r\ncase class TestClass(i: Int, e: TestEnum) {}\r\nval ds = Seq(TestClass(1, TestEnum.E1)).toDS\nError message\nClassNotFoundException: lineb3e041f628634740961b78d5621550d929.$read$$iw$$iw$$iw$$iw$$iw$$iw$TestEnum\nCause\nThe\nClassNotFoundException\nerror occurred because the sample code does not define the class and object in a package cell.\nSolution\nIf you want to use custom Scala classes and objects defined within notebooks (in Apache Spark and across notebook sessions) you must define the class and object inside a package and import the package into your notebook.\nDelete\nInfo\nOnly class and object definitions can go in a package cell. Package cells cannot contain any function definitions, values, or variables.\nDefine the class and object\nThis sample code starts off by creating the package\ncom.databricks.example\n. It then defines the object\nTestEnum\nand assigns values, before defining the class\nTestClass\n.\n%scala\r\n\r\npackage com.databricks.example // Create a package.\r\nobject TestEnum extends Enumeration { // Define an object called TestEnum.\r\n  type TestEnum = Value \r\n  val E1, E2, E3 = Value // Enum values \r\n}\r\ncase class TestClass(i: Int, other:TestEnum.Value) // Define a class called TestClass.\nDelete\nInfo\nClasses defined within packages cannot be redefined without a cluster restart.\nImport the package\nAfter the class and object have been defined, you can import the package you created into a notebook and use both the class and the object.\nThis sample code starts by importing the\ncom.databricks.example\npackage that we just defined.\nIt then evaluates a DataFrame using the\nTestClass\nclass and\nTestEnum\nobject. Both are defined in the\ncom.databricks.example\npackage.\n%scala\r\n\r\nimport com.databricks.example \r\nval df = sc.parallelize(Array(example.TestClass(1,(example.TestEnum.E1)))).toDS().show()\nThe DataFrame successfully displays after the sample code is run.\nPlease review the package cells (\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/use-internal-ntp.json b/scraped_kb_articles/use-internal-ntp.json new file mode 100644 index 0000000000000000000000000000000000000000..10358b2f2c9a551be66fa3e9524f9b1120614d85 --- /dev/null +++ b/scraped_kb_articles/use-internal-ntp.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/use-internal-ntp", + "title": "Título do Artigo Desconhecido", + "content": "By default Databricks clusters use public NTP servers. This is sufficient for most use cases, however you can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a private NTP server under your control. A common use case is to minimize the amount of Internet traffic from your cluster.\nUpdate the NTP configuration on a cluster\nCreate a\nntp.conf\nfile with the following information:\n# NTP configuration\r\nserver iburst\nwhere\n\nis a NTP server hostname or a NTP server IP address.\nIf you have multiple NTP servers to list, add them all to the file. Each server should be listed on its own line.\nUpload the\nntp.conf\nfile to\n/dbfs/databricks/init_scripts/\non your cluster.\nCreate the script\nntp.sh\non your cluster:\n%python\r\n\r\ndbutils.fs.put(\"/databricks/init_scripts/ntp.sh\",\"\"\"\r\n#!/bin/bash\r\necho \" \" >> /etc/hosts\r\ncp /dbfs/databricks/init_scripts/ntp.conf /etc/\r\nsudo service ntp restart\"\"\",True)\nConfirm that the script exists:\n%python\r\n\r\ndisplay(dbutils.fs.ls(\"dbfs:/databricks/init_scripts/ntp.sh\"))\nClick\nClusters\n, click your cluster name, click\nEdit\n, click\nAdvanced Options\n, click\nInit Scripts\n.\nSelect\nDBFS\nunder\nDestination\n.\nEnter the full path to\nntp.sh\nand click\nAdd\n.\nClick\nConfirm and Restart\n. A confirmation dialog box appears. Click\nConfirm\nand wait for the cluster to restart.\nVerify the cluster is using the updated NTP configuration\nRun the following code in a notebook:\n%sh ntpq -p\nThe output displays the NTP servers that are in use." +} \ No newline at end of file diff --git a/scraped_kb_articles/use-snappy-and-zstd-compression-types-in-a-delta-table-without-rewriting-entire-table.json b/scraped_kb_articles/use-snappy-and-zstd-compression-types-in-a-delta-table-without-rewriting-entire-table.json new file mode 100644 index 0000000000000000000000000000000000000000..1625ab9af77835c8801ec3ddf7804b8eab006bbb --- /dev/null +++ b/scraped_kb_articles/use-snappy-and-zstd-compression-types-in-a-delta-table-without-rewriting-entire-table.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/use-snappy-and-zstd-compression-types-in-a-delta-table-without-rewriting-entire-table", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou need to change the compression codec for writing data to improve storage efficiency. Rewriting the entire table is impractical, but you are concerned that switching may corrupt existing tables if you mix\nsnappy\n-compressed files and\nzstd\n-compressed files in a Delta table.\nCause\nDatabricks allows different codecs, but you must ensure data integrity and readability aren't affected by the transition.\nSolution\nFollow the list of steps to test your compression type, generate and insert sample records using\nzstd\n, then write the\nzstd\nfiles to your Delta table. If you already have a Delta table, you can skip to step 4.\nRead 10 numbers into a DataFrame.\ndf = spark.range(0, 10).toDF(\"numbers\") \r\ndisplay(df)\nPersist the DataFrame records as a Delta table.\ndf.write.format('delta').mode(\"append\").saveAsTable(\"..\")\nDescribe detail using\n%sql describe detail catalog_name.schema_name.table_name\nto get the location of your Delta table, and run the\n%fs ls\ncommand on the table location to see the list of files. All files in the list are written by default in\n.snappy.parquet\ncompression codec file format.\n%fs ls //\nTo switch the compression type from the default\nsnappy\nto\nzstd\n, generate sample records and insert them into the table using either incremental updates or append.\ndf_zstd = spark.range(11, 20).toDF(\"numbers\")\r\ndisplay(df_zstd)\nWrite this new data to the existing Delta table by changing the compression type to\nzstd\nusing incremental updates or append.\ndf_zstd.write.format('delta').option(\"compression\",\"zstd\").mode(\"append\").saveAsTable(\"..\")\nRun the\n%fs ls\ncommand on the table location again to see the list of\n.snappy.parquet\nand newly changed\n.zstd.parquet\nfiles, which co-exist in the table location.\n%fs ls //\nReview the table to verify the records by running a select statement. Ensure you can read all the data without corrupting the table.\n%sql select * ..;" +} \ No newline at end of file diff --git a/scraped_kb_articles/user-does-not-have-permission-select-on-any-file.json b/scraped_kb_articles/user-does-not-have-permission-select-on-any-file.json new file mode 100644 index 0000000000000000000000000000000000000000..df343822426bc3435f44b9d87dd867c4959469ac --- /dev/null +++ b/scraped_kb_articles/user-does-not-have-permission-select-on-any-file.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/user-does-not-have-permission-select-on-any-file", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to create an external hive table, but keep getting a\nUser does not have permission SELECT on any file\nerror message.\njava.lang.SecurityException: User does not have permission SELECT on any file.\nTable access control (\nAWS\n|\nAzure\n|\nGCP\n) is enabled your cluster and you are not an admin.\nCause\nThe Databricks SQL query analyzer enforces access control policies at runtime on Databricks clusters with table access control enabled as well as all SQL warehouses.\nWhen table access control is enabled on a cluster, the user must have specific permission to access a table in order to be able to read the table.\nThe only users who can bypass table access control are Databricks admins.\nSolution\nAn admin must grant\nSELECT\npermission on files so the selected user can create a table.\nDelete\nWarning\nUsers granted access to ANY FILE can bypass the restrictions put on the catalog, schemas, tables, and views by reading from the filesystem directly.\nReview the\nData object privileges\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for more information.\nDelete\nInfo\nThe following steps must be run as an Admin.\nAdmins can also grant permissions to groups instead of users.\nStart the cluster.\nOpen a notebook.\nRun the following to grant\nSELECT\npermission on any file to the specified user.\n%sql\r\nGRANT SELECT ON ANY FILE TO ``" +} \ No newline at end of file diff --git a/scraped_kb_articles/user-not-found-error-while-trying-to-install-a-library-on-a-shared-cluster.json b/scraped_kb_articles/user-not-found-error-while-trying-to-install-a-library-on-a-shared-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..92ed9d9917e9a3e086a81c658f14651473f6594e --- /dev/null +++ b/scraped_kb_articles/user-not-found-error-while-trying-to-install-a-library-on-a-shared-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/user-not-found-error-while-trying-to-install-a-library-on-a-shared-cluster", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen trying to install a library using the UI on a shared cluster, you encounter a\nUSER_ID_NOT_FOUND_FAILURE\nerror.\n\"Library installation failed after PENDING for 0 minutes since the cluster entered the RUNNING state. Error Code: USER_ID_NOT_FOUND_FAILURE. Library installation failed on cluster  due to an invalid user. Please reinstall the library.\"\nCause\nFor shared clusters, the library installer's identity determines whether the user has permission to install libraries on clusters and access the library files.\nChanging a shared cluster’s ownership away from a deleted user to an active user doesn’t resolve access issues because the library installer’s identity is still set to the deleted user. The library installer’s identity is no longer valid in the workspace's cluster libraries database.\nAccess Mode\nInstallation Identity\nInstallation identity  before ownership change\nInstallation identity  before ownership change to User B\nShared\nInstaller’s identity\nUser A (Installer’s identity)\nUser A (Installer’s identity)\nSolution\nChange the ownership of the cluster to a different, active user in the workspace. Then, reconfigure desired libraries to match the new owner.\nDelete all libraries from the cluster.\nTerminate the cluster.\nStart the cluster.\nReinstall each library one by one." +} \ No newline at end of file diff --git a/scraped_kb_articles/users-being-provisioned-in-databricks-outside-of-scim.json b/scraped_kb_articles/users-being-provisioned-in-databricks-outside-of-scim.json new file mode 100644 index 0000000000000000000000000000000000000000..070b74a576d098cb56b37c0a5547689a124d113d --- /dev/null +++ b/scraped_kb_articles/users-being-provisioned-in-databricks-outside-of-scim.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/users-being-provisioned-in-databricks-outside-of-scim", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou notice users are being provisioned outside of the SCIM process in your organization’s Databricks environment. For example, a new employee is able to log into your organization’s Databricks account console using SSO (SAML) and set up an account before being provisioned through SCIM.\nCause\nThe\nAuto user creation\nsetting in the Databricks workspace admin settings is enabled.\nSolution\nDisable the\nAuto user creation\nsetting in the Databricks workspace admin settings.\n1. Log in to the Databricks workspace as an admin.\n2. Navigate to the admin console.\n3. Click the\nSettings\ntab.\n4. Scroll down to the\nAuthentication\nsection.\n5. Uncheck the\nAuto user creation\ncheckbox.\n6. Click\nSave\nto save the changes.\nIt is also recommended to check the audit logs to ensure that no other users have been provisioned outside of SCIM. If any users have been provisioned outside of SCIM, they should be manually de-provisioned and then re-provisioned through SCIM." +} \ No newline at end of file diff --git a/scraped_kb_articles/users-unable-to-view-job-results-when-using-remote-git-source.json b/scraped_kb_articles/users-unable-to-view-job-results-when-using-remote-git-source.json new file mode 100644 index 0000000000000000000000000000000000000000..b344df0d91086a5a0773256bd13f1f0e5a385293 --- /dev/null +++ b/scraped_kb_articles/users-unable-to-view-job-results-when-using-remote-git-source.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/jobs/users-unable-to-view-job-results-when-using-remote-git-source", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running a job using notebooks that are stored in a remote Git repository (\nAWS\n|\nAzure\n|\nGCP\n). Databricks users with\nCan View\npermissions (who are not a workspace admin or owners of the job) cannot access or view the results of ephemeral jobs submitted via\ndbutils.notebook.run()\nfrom parent notebook.\nCause\nWhen job visibility control (\nAWS\n|\nAzure\n|\nGCP\n) is enabled in the workspace, users can only see jobs permitted by their access control level. This works as expected with notebooks stored in the workspace. However, Databricks does not manage access control for your remote Git repo, so it does not know if there are any permission restrictions on notebooks stored in Git. The only Databricks user that definitely has permission to access notebooks in the remote Git repo is the job owner. As a result, other non-admin users are blocked from viewing, even if they have\nCan View\npermissions in Databricks.\nSolution\nYou can work around this issue by configuring the job notebook source as your Databricks workspace and make the first task of your job fetch the latest changes from the remote Git repo.\nThis allows for scenarios where the job has to ensure it is using the latest version of a notebook stored in a shared, remote Git repo.\nThis image illustrates the two-part process. First, you fetch the latest changes to the notebook from the remote Git repo. Then, after the latest version of the notebook has been synced, you start running the notebook as part of the job.\nDelete\nInfo\nWhen you create your job in the UI, make sure\nType\nis set to\nNotebook\nand\nSource\nis set to\nWorkspace\n. Select\nRepos\nwhen choosing the notebook path.\nConfigure secret access\nCreate a Databricks personal access token\nFollow the\nPersonal access tokens for users\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to create a personal access token.\nCreate a secret scope\nFollow the\nCreate a Databricks-backed secret scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to create a secret scope.\nStore your personal access token and your Databricks instance in the secret scope\nFollow the\nCreate a secret in a Databricks-backed scope\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to store the personal access token you created and your Databricks instance as new secrets within your secret scope.\nYour Databricks instance is the hostname for your workspace, for example, xxxxx.cloud.databricks.com.\nUse a script to sync the latest changes\nThis sample Python code pulls the latest revision from the remote Git repo and syncs it with the local notebook. This ensures the local notebook is up-to-date before processing the job.\nYou need to replace the following values in the script before running:\n\n- The name of the remote Git repo.\n\n- The name of your scope that holds the secrets.\n\n- The name of the secret that holds your Databricks instance.\n\n- The name of the secret that holds your personal access token.\n%python\r\n\r\nimport requests\r\nimport json\r\ndatabricks_instance = dbutils.secrets.get(scope = \"\", key = \"\")\r\ntoken = dbutils.secrets.get(scope = \"\", key = \"\")\r\n\r\nurl = f\"{databricks_instance}/api/2.0/repos/\" # Use repos API to get repo id https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/get-repos\r\npayload = json.dumps({\r\n \"branch\": \"main\" # use branch/tag. Refer https://docs.databricks.com/dev-tools/api/latest/repos.html#operation/update-repo\r\n})\r\nheaders = {\"Authorization\": f\"Bearer {token}\", \"Content-Type\": \"application/json\"}\r\n\r\nresponse = requests.request(\"PATCH\", url, headers=headers, data=payload, timeout=60)\r\nprint(response.text)\r\nif response.status_code != 200:\r\n    raise Exception(f\"Failure during fetch operation. Response code: {response}\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/users-with-reader-role-in-azure-appearing-in-databricks-workspace-admin-group.json b/scraped_kb_articles/users-with-reader-role-in-azure-appearing-in-databricks-workspace-admin-group.json new file mode 100644 index 0000000000000000000000000000000000000000..3fb7ee0a384fb49cbf4c7ba97354d95d1e5cb7b5 --- /dev/null +++ b/scraped_kb_articles/users-with-reader-role-in-azure-appearing-in-databricks-workspace-admin-group.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/security/users-with-reader-role-in-azure-appearing-in-databricks-workspace-admin-group", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAs a Databricks workspace admin, you may notice that some users in your admin group only have the Reader role assigned in your Azure portal. You don’t expect the Reader role to have administrative privileges in Databricks.\nCause\nThis behavior occurs when the user with a Reader role has additional Azure role assignments, either directly or through group membership, that include the required permissions to manage the Databricks workspace.\nAccording to Microsoft’s documentation, a user with one of the following Azure portal built-in roles are automatically made Databricks workspace admins when they launch the workspace from the Azure portal.\nContributor\nOwner\nAny custom role that includes the required Azure admin permissions\nFor details on the required Azure admin permissions, review the\nAzure Databricks administration introduction\ndocumentation.\nFor more information on users, review the\nManage users\ndocumentation.\nSolution\nReview and adjust role assignments in the Azure portal.\nIn the Azure portal, navigate to the resource group or subscription level where the Databricks workspace is deployed.\nClick\nAccess control (IAM)\n.\nLocate the user or group which should not appear in the Databricks workspace admin group.\nReview all role assignments for the user or group, including inherited and group-based roles.\nReview the current configuration and refer to the official documentation to ensure it is set up correctly based on your requirements.\nFor additional details, see the “What are workspace admins?” section of the\nAzure Databricks administration introduction\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-collect_list-after-transformations-such-as-join-returns-inconsistent-counts-even-though-the-underlying-data-doesnt-change.json b/scraped_kb_articles/using-collect_list-after-transformations-such-as-join-returns-inconsistent-counts-even-though-the-underlying-data-doesnt-change.json new file mode 100644 index 0000000000000000000000000000000000000000..43655fc626196d4837963cd0bf1d9c73a35f57de --- /dev/null +++ b/scraped_kb_articles/using-collect_list-after-transformations-such-as-join-returns-inconsistent-counts-even-though-the-underlying-data-doesnt-change.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/using-collect_list-after-transformations-such-as-join-returns-inconsistent-counts-even-though-the-underlying-data-doesnt-change", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using\ncollect_list\nafter a transformation such as a JOIN or GROUP BY, you observe fewer resultant records than you expected, and the count varies with each execution even though the underlying data remains unchanged.\nExample code\n%python\r\nfrom pyspark.sql import functions as F\r\ndf2=df1.groupBy(‘column_1’).agg(F.collect_list(‘column_2’).alias(‘collect_list_output’))\nCause\nThe Apache Spark\ncollect_list\nfunction is non-deterministic. Its results depend on the order of rows, which may also be non-deterministic after a shuffle. This can lead to unexpected results in subsequent joins or transformations.\nSolution\nModify your code to sort the list using the\narray_sort\nfunction after\ncollect_list\n. This ensures a consistent order in the derived column, which can help to prevent unexpected results in subsequent joins or transformations.\nExample code - modified\n%python\r\ndf2=df1.groupBy(‘column_1’).agg(F.array_sort(collect_list(‘column_2’)).alias(‘sorted_collect_list_output’))\nPreventative measures\nEnsure your rows and transformations are deterministic by implementing changes in your code based on your specific use case, such as ordering the rows, especially if you also use other non-deterministic functions, such as\ncollect_set()\n,\nfirst()\n,\nlast()\nand window functions like\nrow_number()\nwith duplicate ordering keys.\nThe solution provided in this article is scoped to\ncollect_list\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-datetime-values-in-spark-30-and-above.json b/scraped_kb_articles/using-datetime-values-in-spark-30-and-above.json new file mode 100644 index 0000000000000000000000000000000000000000..0c4512105ca0207c3a24baa23d1d91143e83d743 --- /dev/null +++ b/scraped_kb_articles/using-datetime-values-in-spark-30-and-above.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/sql/using-datetime-values-in-spark-30-and-above", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are migrating jobs from unsupported clusters running Databricks Runtime 6.6 and below with Apache Spark 2.4.5 and below to clusters running a current version of the Databricks Runtime.\nIf your jobs and/or notebooks process date conversions, they may fail with a\nSparkUpgradeException\nerror message after running them on upgraded clusters.\nError in SQL statement: SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'YYYY-MM-DD' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from\nhttps://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html\nCause\nPrior to Spark 3.0, Spark used a combination of the Julian and Gregorian calendars. For dates before 1582, Spark used the Julian calendar. For dates after 1582, Spark used the Gregorian calendar.\nIn Spark 3.0 and above, Spark uses the Proleptic Gregorian calendar. This calendar is also used by other systems such as Apache Arrow, Pandas, and R.\nThe calendar usage is inherited from the legacy\njava.sql.Date\nAPI, which was superseded in Java 8 by\njava.time.LocalDate\nand uses the Proleptic Gregorian calendar.\nSolution\nYou should update your DateTime references so they are compatible with Spark 3.0 and above.\nFor example, if you try to parse a date in the format\nYYYY-MM-DD\n, it returns an error in Spark 3.0 and above.\nselect TO_DATE('2017-01-01', 'YYYY-MM-DD') as date\nUsing the format\nyyyy-MM-dd\nworks correctly in Spark 3.0 and above.\nselect TO_DATE('2017-01-01', 'yyyy-MM-dd') as date\nThe difference in capitalization may appear minor, but to Spark,\nD\nreferences the day-of-year, while\nd\nreferences the day-of-month when used in a DateTime function.\nReview all of the defined Spark\nDateTime patterns for formatting and parsing\nfor more details.\nDelete\nInfo\nIf you want to temporarily revert to Spark 2.x DateTime formatting, you can set\nspark.sql.legacy.timeParserPolicy\nto\nLEGACY\nin a notebook. You can also set this value in the cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nWhile this option works, it is only recommended as a temporary workaround." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-dbutils-inside-a-udf-to-retrieve-credentials-fails-with-permissions-error.json b/scraped_kb_articles/using-dbutils-inside-a-udf-to-retrieve-credentials-fails-with-permissions-error.json new file mode 100644 index 0000000000000000000000000000000000000000..15482dcc2ca2ecdcfc5f641e7d7ab1b6d384a176 --- /dev/null +++ b/scraped_kb_articles/using-dbutils-inside-a-udf-to-retrieve-credentials-fails-with-permissions-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/notebooks/using-dbutils-inside-a-udf-to-retrieve-credentials-fails-with-permissions-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to use\ndbutils\ninside a User-Defined Function (UDF) to retrieve credentials from a Databricks scope or secrets.\nYou receive a “permission denied” or “execution of function failed” error indicating a permissions issue, even though you have access to the secrets.\nPermissionError: [Errno 13] Permission denied\nOr\n[UDF_USER_CODE_ERROR.GENERIC] Execution of function ..dbutils_test() failed.\r\n== Error ==\r\nValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.\nWhen you review the stack trace, you see the following detail.\n== Stacktrace ==\r\nFile \"\", line 3, in main\r\nfrom databricks.sdk.runtime import spark\r\nFile \"/databricks/python3/lib/python3.12/site-packages/databricks/sdk/runtime/__init__.py\", line 172, in \r\ndbutils = RemoteDbUtils()\r\n^^^^^^^^^^^^^^^\r\nFile \"/databricks/python3/lib/python3.12/site-packages/databricks/sdk/dbutils.py\", line 194, in _init_\r\nself._config = Config() if not config else config\r\n^^^^^^^^\r\nFile \"/databricks/python3/lib/python3.12/site-packages/databricks/sdk/config.py\", line 127, in _init_\r\nraise ValueError(message) from e SQLSTATE: 39000\nYou notice that the UDF works if you use hard-coded credentials.\nCause\nYou cannot use\ndbutils\ninside a UDF because the UDF runs on an Apache Spark worker node. Attempting to use\ndbutils\ninside a UDF causes the code to fail to execute, resulting in a permission denied error.\nSolution\nFetch secrets on the driver before invoking the UDF, and pass them as function arguments. This pattern ensures the UDF receives credentials securely without directly calling\ndbutils\ninside worker-executed code.\nRetrieve database credentials outside the UDF on the driver\nCreate a secret scope. Follow the\nTutorial: Create and use a Databricks secret\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nRun the following code in a notebook to retrieve the database-related secrets saved in your Databricks secret scope.\n%python\r\n\r\ndatabase_username = dbutils.secrets.get(scope=\"scope-database\", key=\"database-username\")\r\n\r\ndatabase_password = dbutils.secrets.get(scope=\"scope-database\", key=\"database-password\")\r\n\r\ndatabase_name = dbutils.secrets.get(scope=\"scope-database\", key=\"database_name\")\r\n\r\ndatabase_host = dbutils.secrets.get(scope=\"scope-database\", key=\"database-host\")\r\n```\nCreate a sample SQL UDF with arguments\nNote\nThis step will look different depending on your database connector library. This article uses PyMongo as an example. For other common database Python connector libraries, refer to the following documentation.\nOracle:\nConnecting to Oracle Database\nRedshift:\nExamples of using the Amazon Redshift Python connector\nPostgre:\nBasic module usage\nEnsure your database Python connector package is already installed on the cluster. You can run\n%pip install \n(for example,\npymongo\n) if not.\nCreate the SQL UDF using the database credentials you created and fetched in the previous section.\n%sql\r\n\r\nCREATE OR REPLACE FUNCTION ..(\r\ndatabase_username STRING,\r\n \tdatabase_password STRING,\r\n \tdatabase_name STRING,\r\n \tdatabase_host STRING\r\n)\r\nRETURNS STRING\r\nLANGUAGE PYTHON\r\nAS $$\r\n\r\nimport pymongo\r\n\r\ndef main(database_username, database_password, database_name, database_host):\r\n    # Connect to the database\r\n    uri = f\"mongodb+srv://{database_username}:{database_password}@{database_host}/{database_name}?retryWrites=true&w=majority\"\r\n    client = pymongo.MongoClient(uri)\r\n\r\n    # Access collection and return value from first document\r\n    doc = client[database_name][\"your_collection\"].find_one({}, {\"your_field\": 1, \"_id\": 0})\r\n    return doc.get(\"your_field\") if doc else None\r\n\r\n$$;\nUse the UDF in a SQL query\nOnce defined, you can call the UDF from any SQL query, passing arguments dynamically using Python or manually in SQL.\n%python\r\n\r\nsql_query = f\"\"\"\r\nSELECT\r\n  ..(\r\n    '{database_username}',\r\n    '{database_password}',\r\n    '{database_name}',\r\n    '{database_host}'\r\n  ) AS value\r\n\"\"\"\r\ndf = spark.sql(sql_query)\r\ndf.display()\nFor more info on SQL UDFs, refer to the\nCREATE FUNCTION (SQL and Python)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-double-quotes-in-a-query-causes-a-syntax-error.json b/scraped_kb_articles/using-double-quotes-in-a-query-causes-a-syntax-error.json new file mode 100644 index 0000000000000000000000000000000000000000..d0546ab4d8add74f95530421ee2933a6576e1924 --- /dev/null +++ b/scraped_kb_articles/using-double-quotes-in-a-query-causes-a-syntax-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/using-double-quotes-in-a-query-causes-a-syntax-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou're trying to execute a SQL query with a column alias using double quotes, such as the following.\nSELECT 1 AS \"a\"\nThe query results in a syntax error.\n[PARSE_SYNTAX_ERROR] Syntax error at or near '\"a\"'. SQLSTATE: 42601\nCause\nBy default, Apache Spark does not allow the use of double quotes for identifiers in SQL queries. Spark interprets double quotes as string literals instead of identifiers, resulting in a syntax error.\nSolution\nEnable the\ndoubleQuotedIdentifiers\nsetting in your Spark configuration. At the same time, set\nspark.sql.ansi.enabled\nto\ntrue\n. The\ndoubleQuotedIdentifiers\nsetting requires ANSI mode to be enabled. For details, refer to the\nSpark Configuration\ndocumentation.\nYou can set the options at the notebook level. Run the following commands on a new cell on top of your workload.\n%python\r\nspark.conf.set(\"spark.sql.ansi.enabled\", \"true\")\r\nspark.conf.set(\"spark.sql.doubleQuotedIdentifiers\", \"true\")" +} \ No newline at end of file diff --git a/scraped_kb_articles/using-glob-patterns-for-directory-filtering-impacting-auto-loader-performance.json b/scraped_kb_articles/using-glob-patterns-for-directory-filtering-impacting-auto-loader-performance.json new file mode 100644 index 0000000000000000000000000000000000000000..b9333d131ab7180cabe3931f5a72fbb4d807b68a --- /dev/null +++ b/scraped_kb_articles/using-glob-patterns-for-directory-filtering-impacting-auto-loader-performance.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/using-glob-patterns-for-directory-filtering-impacting-auto-loader-performance", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nAlthough using glob patterns to filter directories during file discovery in Auto Loader is a powerful tool, in certain cases using glob patterns has a significant impact on Auto Loader performance, especially when using partitions.\nCause\nWhile glob patterns help define which directories to include, they don’t limit Auto Loader's initial file discovery scan. Auto Loader still evaluates all subdirectories under the specified root. The glob pattern acts as a filter after the scan, determining which files or directories are processed further.\nExample\nThe pattern\n/mnt/my_table/{year=2025/month=1/day=2,year=2025/month=1/day=3}\nwill cause the Auto Loader to scan all partitions and sub partitions even though only 2 days of data is desired.\nSolution\nUse a more specific root path to reduce the scope of the initial scan.\nExample\nInstead of\n/mnt/my_table/{year=2025/month=1/day=2,year=2025/month=1/day=3}\n, use separate paths.\n/mnt/my_table/year=2025/month=1/{day=2,day=3}" +} \ No newline at end of file diff --git a/scraped_kb_articles/using-like-statement-causing-slower-performance-in-lakehouse-federation-query.json b/scraped_kb_articles/using-like-statement-causing-slower-performance-in-lakehouse-federation-query.json new file mode 100644 index 0000000000000000000000000000000000000000..16e9e19803f2748d15c6d24b618f1816778c853c --- /dev/null +++ b/scraped_kb_articles/using-like-statement-causing-slower-performance-in-lakehouse-federation-query.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/using-like-statement-causing-slower-performance-in-lakehouse-federation-query", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using Lakehouse Federation to run a query that uses a\nLIKE\nstatement and references tables that are connected to Lakehouse Federation, you notice slower performance compared to executing the query directly on the underlying database.\nCause\nIn Lakehouse Federation, the\nLIKE\ncommand is not passed as a pushdown filter. This means the underlying query runs more slowly than running the statement directly on the target system.\nSolution\nReplace\nLIKE\nstatements with alternative filter options when performing pushdowns Lakehouse Federation.\nCONTAINS\nto search for a specific string within a column.\nSTARTSWITH\nto search for a specific string at the beginning of a column.\nENDSWITH\nto search for a specific string at the end of a column." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-mlflow-api-call-to-load-a-model-taking-the-same-amount-of-time-every-call-and-artifacts-downloading-from-scratch.json b/scraped_kb_articles/using-mlflow-api-call-to-load-a-model-taking-the-same-amount-of-time-every-call-and-artifacts-downloading-from-scratch.json new file mode 100644 index 0000000000000000000000000000000000000000..40688103e15007cc3acb04da7b77603960ff029d --- /dev/null +++ b/scraped_kb_articles/using-mlflow-api-call-to-load-a-model-taking-the-same-amount-of-time-every-call-and-artifacts-downloading-from-scratch.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/using-mlflow-api-call-to-load-a-model-taking-the-same-amount-of-time-every-call-and-artifacts-downloading-from-scratch", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you load a model using the MLflow API call\nmlflow..load_model()\nrepeatedly, you notice the calls keep taking the same amount of time for each call, and artifacts are downloaded from scratch each time.\nCause\nBy default, MLflow retrieves and downloads model artifacts from the respective storage location each time they are called.\nSolution\nFirst, use\nmlflow.artifacts.download_artifacts()\nto save the model artifacts locally.\nimport mlflow\r\nmodel_uri = f\"models:/{}/{}\"\r\ndestination_path = \"/local_disk0/model\"\r\nmlflow.artifacts.download_artifacts(artifact_uri=model_uri,dst_path=destination_path)\nThen, load the model from the local path using\nmlflow..load_model()\ninstead, allowing faster subsequent loads from local storage.\nmodel_uri = \"/local_disk0/model\"\r\nmlflow..load_model(model_uri)\nFor more information, refer to the Mlflow API\nmlflow.artifacts\ndocumentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-parse_json-and-explode-to-flatten-data-returns-datatype_mismatchunexpected_input_type-error.json b/scraped_kb_articles/using-parse_json-and-explode-to-flatten-data-returns-datatype_mismatchunexpected_input_type-error.json new file mode 100644 index 0000000000000000000000000000000000000000..59a24fa5c555364cbf03392bffb2956a5ec752b7 --- /dev/null +++ b/scraped_kb_articles/using-parse_json-and-explode-to-flatten-data-returns-datatype_mismatchunexpected_input_type-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/using-parse_json-and-explode-to-flatten-data-returns-datatype_mismatchunexpected_input_type-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou parse a JSON string using\nparse_json\nand attempt to flatten the ARRAY using the\nexplode\nfunction. The following image shows a notebook cell with an example.\nThe code returns the following error where\n$.Company.Department'\nare the example object and property names.\n[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve \"explode(variant_get(json_data, $.Company.Department))\" due to data type mismatch: The first parameter requires the (\"ARRAY\" or \"MAP\") type, however \"variant_get(json_data, $.Company.Department)\" has the type \"VARIANT\". SQLSTATE: 42K09\nCause\nThe\nparse_json\nfunction returns a\nVARIANT\ntype, but the\nexplode\nfunction requires an\nARRAY\ntype as a parameter.\nSolution\nUse the\nvariant_get\nfunction\n(\nAWS\n|\nAzure\n|\nGCP\n) or\nvariant_explode\ntable-valued function\n(\nAWS\n|\nAzure\n|\nGCP\n) to cast the payload to an\nARRAY\ntype. Then use the\nexplode\nfunction to flatten it.\nExample using\nvariant_get\nReplace the example JSON object with your own object, and adjust\n$.Company.Department'\nto your own object and property names.\nwith all_data as (\r\n  select \r\n    variant_get(\r\n      parse_json(\r\n        '{\"Company\":{\"Department\":[{\"@id\":\"001\",\"@name\":\"Sales\",\"Employee\":[{\"@id\":\"E001\",\"Contact\":{\"Email\":\"johndoe@example.com\",\"Phone\":{\"#text\":\"555-1234\",\"@type\":\"work\"}},\"Name\":\"John Doe\",\"Projects\":{\"Project\":  [{\"@code\":\"P001\",\"Budget\":{\"#text\":\"100000\",\"@currency\":\"USD\"},\"Title\":\"Project Alpha\"},{\"@code\":\"P002\",\"Budget\":{\"#text\":\"75000\",\"@currency\":\"USD\"},\"Title\":\"Project Beta\"}]},\"Role\":\"Manager\"},{\"@id\":\"E002\",\"Contact\":{\"Email\":\"janesmith@example.com\",\"Phone\":{\"#text\":\"555-9876\",\"@type\":\"mobile\"}},\"Name\":\"Jane Smith\",\"Role\":\"Sales Representative\",\"Sales\":{\"Region\":{\"@name\":\"North\",\"Achieved\":\"450000\",\"Target\":\"500000\"}}}]},{\"@id\":\"002\",\"@name\":\"Engineering\",\"Team\":[{\"@name\":\"Development\",\"Employee\":[{\"@id\":\"E003\",\"Name\":\"Alice Johnson\",\"Role\":\"Software Engineer\",\"Skills\":{\"Skill\":[{\"#text\":\"Python\",\"@level\":\"advanced\"},{\"#text\":\"Java\",\"@level\":\"intermediate\"},{\"#text\":\"Go\",\"@level\":\"beginner\"}]}},{\"@id\":\"E004\",\"Name\":\"Bob Lee\",\"Role\":\"DevOps Engineer\",\"Tools\":{\"Tool\":[\"Docker\",\"Kubernetes\"]}}]},{\"@name\":\"QA\",\"Employee\":{\"@id\":\"E005\",\"Name\":\"Charlie Brown\",\"Responsibilities\":{\"Responsibility\":[\"Automated Testing\",\"Manual Testing\"]},\"Role\":\"QA Analyst\"}}]}]}}'\r\n      ), \r\n      '$.Company.Department'\r\n    ) as json_data\r\n) \r\nselect \r\n  explode(\r\n    CAST(json_data AS ARRAY < VARIANT >)\r\n  ) as dept \r\nfrom \r\n  all_data\nThe following image shows the example code’s output in a notebook cell.\nExample using\nvariant_explode\nReplace the example JSON object with your own object, and adjust\n$.Company.Department'\nto your own object and property names.\nwith all_data as (\r\n  select \r\n    parse_json(\r\n   '{\"Company\":{\"Department\":[{\"@id\":\"001\",\"@name\":\"Sales\",\"Employee\":[{\"@id\":\"E001\",\"Contact\":{\"Email\":\"johndoe@example.com\",\"Phone\":{\"#text\":\"555-1234\",\"@type\":\"work\"}},\"Name\":\"John Doe\",\"Projects\":{\"Project\":[{\"@code\":\"P001\",\"Budget\":{\"#text\":\"100000\",\"@currency\":\"USD\"},\"Title\":\"Project Alpha\"},{\"@code\":\"P002\",\"Budget\":{\"#text\":\"75000\",\"@currency\":\"USD\"},\"Title\":\"Project Beta\"}]},\"Role\":\"Manager\"},{\"@id\":\"E002\",\"Contact\":{\"Email\":\"janesmith@example.com\",\"Phone\":{\"#text\":\"555-9876\",\"@type\":\"mobile\"}},\"Name\":\"Jane Smith\",\"Role\":\"Sales Representative\",\"Sales\":{\"Region\":{\"@name\":\"North\",\"Achieved\":\"450000\",\"Target\":\"500000\"}}}]},{\"@id\":\"002\",\"@name\":\"Engineering\",\"Team\":[{\"@name\":\"Development\",\"Employee\":[{\"@id\":\"E003\",\"Name\":\"Alice Johnson\",\"Role\":\"Software Engineer\",\"Skills\":{\"Skill\":[{\"#text\":\"Python\",\"@level\":\"advanced\"},{\"#text\":\"Java\",\"@level\":\"intermediate\"},{\"#text\":\"Go\",\"@level\":\"beginner\"}]}},{\"@id\":\"E004\",\"Name\":\"Bob Lee\",\"Role\":\"DevOps Engineer\",\"Tools\":{\"Tool\":[\"Docker\",\"Kubernetes\"]}}]},{\"@name\":\"QA\",\"Employee\":{\"@id\":\"E005\",\"Name\":\"Charlie Brown\",\"Responsibilities\":{\"Responsibility\":[\"Automated Testing\",\"Manual Testing\"]},\"Role\":\"QA Analyst\"}}]}]}}'\r\n    ) as json_data\r\n) \r\nselect \r\n  t.value \r\nfrom \r\n  all_data \r\n  join lateral variant_explode(\r\n    variant_get(\r\n      json_data, '$.Company.Department'\r\n    )\r\n  ) as t\nThe following image shows the example code’s output in a notebook cell." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-percentile-to-work-with-large-datasets-causing-memory-issues-or-oom-errors.json b/scraped_kb_articles/using-percentile-to-work-with-large-datasets-causing-memory-issues-or-oom-errors.json new file mode 100644 index 0000000000000000000000000000000000000000..c3b4638bf6884088581400aa2b5e81c8bccc004b --- /dev/null +++ b/scraped_kb_articles/using-percentile-to-work-with-large-datasets-causing-memory-issues-or-oom-errors.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/using-percentile-to-work-with-large-datasets-causing-memory-issues-or-oom-errors", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you use the\npercentile()\naggregate expression in PySpark and Photon to work with large datasets or datasets with many distinct values, you notice severe memory issues, including out-of-memory (OOM) errors.\nExample stack trace\nPhoton failed to reserve 43.3 MiB for percentile-merge-expr memory pool, in task.\r\nMemory usage:\r\nTotal task memory (including non-Photon): 6.6 GiB\r\ntask: allocated 6.6 GiB, tracked 6.6 GiB, untracked allocated 3.2 GiB, peak 6.6 GiB\r\nBufferPool: allocated 128.0 MiB, tracked 128.0 MiB, untracked allocated 3.2 GiB, peak 128.0 MiB\r\nDataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B\r\nShuffleExchangeSourceNode(id=XXX, output_schema=[string, string, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, ... 105 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B\r\nGroupingAggNode(id=XXX, output_schema=[string, string, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, ... 105 more]): allocated 6.0 MiB, tracked 6.0 MiB, untracked allocated 0.0 B, peak 5.7 GiB\r\nGroupingAggregation(recursion_depth=0): allocated 6.0 MiB, tracked 6.0 MiB, untracked allocated 0.0 B, peak 5.7 GiB\r\nGC'd var-len aggregates: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 3.3 GiB\nCause\nThe\npercentile()\nfunction causes memory issues in three ways.\nBoth Apache Spark and Photon buffer the entire dataset into memory as a single, large array of values. This can quickly exhaust memory resources, especially with high cardinality datasets.\nSpilling to disk is challenging with the current approach because splitting and spilling a single large array of values is complex and inefficient.\nSorting the buffered values adds further memory and computational overhead.\nSolution\nUse\napprox_percentile()\ninstead. The\napprox_percentile()\nfunction is an approximate alternative to\npercentile()\nwith a fixed and predictable memory footprint. It avoids buffering the entire dataset in memory and is suitable for large-scale datasets where exact percentile calculation may be unnecessary.\nNote\nSimply disabling Photon does not resolve the issue because Spark uses a similar algorithm for\npercentile()\n, which inherits the same limitations." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-pyspark-testing-library-assertdataframeequal-throws-outofmemoryerror.json b/scraped_kb_articles/using-pyspark-testing-library-assertdataframeequal-throws-outofmemoryerror.json new file mode 100644 index 0000000000000000000000000000000000000000..93e86d47344e384ed4678fa6bc62fe1043707e43 --- /dev/null +++ b/scraped_kb_articles/using-pyspark-testing-library-assertdataframeequal-throws-outofmemoryerror.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/python/using-pyspark-testing-library-assertdataframeequal-throws-outofmemoryerror", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen working with the Pyspark testing library\nassertDataFrameEqual\n, you expect\nassertDataFrameEqual\nto confirm DataFrame equivalence, or throw an assertion error if no equivalence and provide information on the differences.\nInstead, you encounter a\nDriverStoppedException\nor an\nOutOfMemoryError\n.\nExample\nIn a Databricks Runtime 15.4 LTS cluster the error message indicates an unresponsive Python kernel.\nfrom pyspark.testing.utils import assertDataFrameEqual\r\n\r\nassertDataFrameEqual(actual=spark.range(100_000_000), expected=spark.range(99_999_999))\r\n\r\nFatal error: The Python kernel is unresponsive.\nCause\nYou’re performing the same types of comparison tests on one or more DataFrames which are larger than available Driver memory.\nInternally, the\nassertDataFrameEqual\nfunction calls\n.collect()\non the actual and expected DataFrames.\nexpected_list = expected.collect()\r\nactual_list = actual.collect()\nCollect\nis a memory-intensive and driver-bound operation, and may lead to unexpected driver behavior such as\nOutOfMemoryError\nor\nDriverStoppedException\nerrors.\nSolution\nFirst, use\nassertSchemaEqual\nto verify first that the schemas are equivalent before comparing the actual rows and columns in the DataFrames. If schemas are unequal, you may not need to evaluate the DataFrame contents given that they will be inherently different as a result of schema differences. For more information on\nassertDataFrameEqual\n, review the Databricks\nSimplify PySpark testing with DataFrame equality functions\nblog post.\nIf you need to consider the entirety of your DataFrames, ensure you have enough memory.\nMonitor driver memory usage and adjust cluster memory by increasing the size of the Driver VM.\nUse a DataFrame operation such as\ndf.subtract()\nto identify differences on larger DataFrames when 100% accuracy is required and the size of the data exceeds the driver memory limits.\nIf you are able, test on a subset of the DataFrames and reduce the overall memory footprint.\nUse\nassertDataFrameEqual\nwith a subset of data using\ndf.limit(n)\nor a sample using\ndf.sample(...)\nUse a subset of columns such as comparing on a unique/primary key only, using\ndf.select(...)" +} \ No newline at end of file diff --git a/scraped_kb_articles/using-terraform-to-rename-a-metastore-in-unity-catalog-destroys-and-recreates-it-instead.json b/scraped_kb_articles/using-terraform-to-rename-a-metastore-in-unity-catalog-destroys-and-recreates-it-instead.json new file mode 100644 index 0000000000000000000000000000000000000000..a7c7851f0fbb66c6b991330e1d1ac55259329ac2 --- /dev/null +++ b/scraped_kb_articles/using-terraform-to-rename-a-metastore-in-unity-catalog-destroys-and-recreates-it-instead.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/using-terraform-to-rename-a-metastore-in-unity-catalog-destroys-and-recreates-it-instead", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to use Terraform to rename a metastore in Unity Catalog. Instead of modifying the metastore, Terraform destroys and recreates it.\nCause\nTerraform detects this resource change as a new resource instead of a modification.\nSolution\nRename the metastore using REST API\nPerform a PATCH request to update the metastore name using the Databricks REST API.\ncurl --request PATCH \"https:///api/2.1/unity-catalog/metastores/\" \\\r\n\t--header \"Authorization: Bearer \" \\\r\n\t--data '{ \"name\": \"\" }'\nFor more information, consult the\nUpdate a metastore\nAPI documentation.\nUpdate the metastore settings using the Unity Catalog CLI\nIn AWS and Azure, you can alternatively utilize the Unity Catalog CLI to modify the metastore configuration. Make a JSON file called\nupdate-metastore.json\n, with the following content.\n{\r\n\t\"name\": \"\"\r\n}\nThen run the following command.\ndatabricks unity-catalog metastores update --id \\\r\n\t--json-file update-metastore.json\nFor more information, review the\nUnity Catalog CLI (legacy)\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/using-vertica-spark-connector-to-write-a-dataframe-to-external-database-fails-with-table_or_view_not_found-error.json b/scraped_kb_articles/using-vertica-spark-connector-to-write-a-dataframe-to-external-database-fails-with-table_or_view_not_found-error.json new file mode 100644 index 0000000000000000000000000000000000000000..d12bc8434437e3a387aeaf96753656c2ca80c7bc --- /dev/null +++ b/scraped_kb_articles/using-vertica-spark-connector-to-write-a-dataframe-to-external-database-fails-with-table_or_view_not_found-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/using-vertica-spark-connector-to-write-a-dataframe-to-external-database-fails-with-table_or_view_not_found-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile using Unity Catalog, when you use the Vertica Spark Connector to write DataFrames to an external database, the process fails with the following error.\n[TABLE_OR_VIEW_NOT_FOUND] The table or view `vertica`.`` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01\nCause\nVertica Spark Connector (a third-party library)’s way of operating conflicts with Unity Catalog’s (UC) three-level namespacing system. The library cannot be used with a UC-enabled cluster.\nDatabricks does not guarantee that Unity Catalog supports third-party libraries.\nSolution\nSwitch to a non-UC cluster, such as\nNo isolation shared\n.\nNavigate to the\nCompute\ntab in your Databricks workspace.\nSelect the desired cluster from the list of available clusters.\nClick the\nEdit\nbutton on the cluster details page.\nChange the\nAccess Mode\nby selecting an appropriate option (such as\nNo isolation shared\n) from the dropdown.\nClick\nSave\nto apply the updates and restart the cluster.\nVerify the updated access mode on the cluster details page." +} \ No newline at end of file diff --git a/scraped_kb_articles/vacuum-best-practices-on-delta-lake.json b/scraped_kb_articles/vacuum-best-practices-on-delta-lake.json new file mode 100644 index 0000000000000000000000000000000000000000..1b06abadedb7a4444a384ced79a302093ce9c110 --- /dev/null +++ b/scraped_kb_articles/vacuum-best-practices-on-delta-lake.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake", + "title": "Título do Artigo Desconhecido", + "content": "Why use\nVACUUM\non Delta Lake?\nVACUUM\nis used to clean up unused and stale data files that are taking up unnecessary storage space. Removing these files can help reduce storage costs.\nWhen you run\nVACUUM\non a Delta table it removes the following files from the underlying file system:\nAny data files that are not maintained by Delta Lake\nRemoves stale data files (files that are no longer referenced by a Delta table) and are older than 7 days\nDelete\nInfo\nVACUUM\ndoes NOT remove directories that begin with an underscore, such as\n_delta_log\n.\nWhen should you run\nVACUUM\n?\nWhen you run\nVACUUM\nit removes stale data files. This does not impact regular work, but it can limit your ability to\ntime travel\n(\nAWS\n|\nAzure\n|\nGCP\n).\nThe default configuration for a Delta table allows you to time travel 30 days into the past. However, to do this, the underlying data files must be present.\nThe default configuration for\nVACUUM\ndeletes stale data files that are older than seven days. As a result, if you run\nVACUUM\nwith the default settings, you will only be able to time travel seven days into the past, from the time you run\nVACUUM\n.\nIf you do not need to time travel more than seven days into the past, you can\nVACUUM\non a daily basis.\nRunning\nVACUUM\ndaily helps keep storage costs in check, especially for larger tables. You can also run\nVACUUM\non-demand if you notice a sudden surge in the storage costs for a specific Delta table.\nIssues you may face with\nVACUUM\nNo progress update\n: You may not know how far the\nVACUUM\nhas completed, especially when\nVACUUM\nhas run for a long time. You may not know how many files have been successfully removed and how many files remain.\nPoor run performance\n:\nVACUUM\nruns for a long time, especially when tables are huge and/or when tables are a source for high frequency input streams.\nMitigate issues with\nVACUUM\nNo progress update\nIf\nVACUUM\ncompletes within an hour or two, there is no need to troubleshoot. However, if\nVACUUM\nruns for longer than two hours (this can happen on large tables when\nVACUUM\nhasn’t been run recently), you may want to check the progress. In this case you can run\nVACUUM\nwith the\nDRY RUN\noption before and after the actual\nVACUUM\nrun to monitor the performance of a specific\nVACUUM\nrun and to identify the number of files deleted.\nRun\nVACUUM DRY RUN\nto determine the number of files eligible for deletion. Replace\n\nwith the actual table path location.\n%python\r\n\r\nspark.sql(\"VACUUM delta.`` DRY RUN\")\nThe\nDRY RUN\noption tells\nVACUUM\nit should not delete any files. Instead,\nDRY RUN\nprints the number of files and directories that are safe to be deleted. The intention in this step is not to delete the files, but know the number of files eligible for deletion.\nThe example\nDRY RUN\ncommand returns an output which tells us that there are\nx\nfiles and directories that are safe to be deleted.\nFound x files and directories in a total of y directories that are safe to delete.\nYou should record the number of files identified as safe to delete.\nRun\nVACUUM\n.\nCancel\nVACUUM\nafter one hour.\nRun\nVACUUM\nwith\nDRY RUN\nagain.\nThe second\nDRY RUN\ncommand identifies the number of outstanding files that can be safely deleted.\nSubtract the outstanding number of files (second\nDRY RUN\n) from the original number of files to get the number of files that were deleted.\nDelete\nInfo\nYou can also review your storage bucket information in your cloud portal to identify the remaining number of files existing in the bucket, or the number of deletion requests issued, to determine how far the deletion has progressed.\nPoor run performance\nThis can be mitigated by following\nVACUUM\nbest practices.\nAvoid actions that hamper performance\nAvoid over-partitioned data folders\nOver-partitioned data can result in a lot of small files. You should avoid partitioning on a high cardinality column. When you over-partition data, even running\nOPTIMIZE\ncan have issues compacting small files, as compaction does not happen across partition directories.\nFile deletion speed is directly dependent on the number of files. Over-partitioning data can hamper the performance of\nVACUUM\n.\nDelete\nInfo\nYou should partition on a low cardinality column and z-order on a high cardinality column.\nAvoid concurrent runs\nWhen running\nVACUUM\non a large table, avoid concurrent runs (including dry runs).\nAvoid running other operations on the same location to avoid file system level throttling. Other operations can compete for the same bandwidth.\nAvoid cloud versioning\nSince Delta Lake maintains version history, you should avoid using cloud version control mechanisms, like\nS3 versioning on AWS\n.\nUsing cloud version controls in addition to Delta Lake can result in additional storage costs and performance degradation.\nActions to improve performance\nEnable\nautoOptimize\n/\nautoCompaction\nRun\nOPTIMIZE\nto eliminate small files. When you combine\nOPTIMIZE\nwith regular\nVACUUM\nruns you ensure the number of stale data files (and the associated storage cost) is minimized.\nReview the documentation on\nautoOptimize\nand\nautoCompaction\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nReview the documentation on\nOPTIMIZE\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nDelete\nInfo\nBefore you modify table properties, you must ensure there are no active writes happening on the table.\nUse Databricks Runtime 10.4 LTS or above and additional driver cores (Azure and GCP only)\nOn Azure and GCP\nVACUUM\nperforms the deletion in parallel on the driver, when using Databricks Runtime 10.4  LTS or above. The higher the number of driver cores, the more the operation can be parallelized.\nDelete\nInfo\nOn AWS deletes happen in batches and the process is single threaded. AWS uses a bulk delete API and deletes in batches of 1000, but it doesn’t use parallel threads. As a result, using a multi-core driver may not help on AWS.\nUse Databricks Runtime 11.1 or above on AWS\nDatabricks Runtime 11.1 and above set the checkpoint creation interval to 100, instead of 10. As a result, fewer checkpoint files are created. With less checkpoint files to index, the faster the listing time in the transaction log directory. This reduces the delta log size and improves the\nVACUUM\nlisting time. It also decreases the checkpoint storage size.\nIf you are using Databricks Runtime 10.4 LTS on AWS and cannot update to a newer runtime, you can manually set the table property with\ndelta.checkpointInterval=100\n. This creates checkpoint files for every 100 commits, instead of every 10 commits.\n%sql\r\n\r\nalter table set tblproperties ('delta.checkpointInterval' = 100)\nDelete\nInfo\nReducing the number of checkpoints on Databricks Runtime 10.4 LTS may degrade table query/read performance, though in most cases the difference should be negligible. Before you modify table properties, you must ensure there are no active writes happening on the table.\nUse compute optimized instances\nSince\nVACUUM\nis compute intensive, you should use compute optimized instances.\nOn AWS use\nC5 series\nworker types.\nOn Azure use\nF series\nworker types.\nOn GCP use\nC2 series\nworker types.\nUse auto-scaling clusters\nBefore performing file deletion, VACUUM command lists the files. File listing happens in parallel by leveraging the workers in the cluster. Having more workers in the cluster can help with the initial listing of files. The higher the number of workers, the faster the file listing process.\nAdditional workers are NOT needed for file deletion. This is why you should use an auto-scaling cluster with multiple workers. Once the file listing completes, the cluster can scale down and use the driver for the file deletion. This saves cluster costs.\nReview the documentation on how to\nenable and configuring autoscaling\n(\nAWS\n|\nAzure\n|\nGCP\n) for more information.\nSet a higher trigger frequency for streaming jobs\nUse a trigger frequency of 120 seconds or more for streaming jobs that write to Delta tables. You can adjust this based on your needs.\n// ProcessingTime trigger with 120 seconds micro-batch interval\r\n\r\ndf.writeStream\r\n  .format(\"console\")\r\n  .trigger(Trigger.ProcessingTime(\"120 seconds\"))\r\n  .start()\nThe higher the trigger frequency, the bigger the data files. The bigger the data files, the lesser the number of total files. The lesser the number of total files, the less time it takes to delete files. As a result, future\nVACUUM\nattempts run faster.\nReduce log retention\nIf you do not need to time travel far into the past, you can reduce log retention to seven days. This reduces the number of JSON files and thereby reduces the listing time. This also reduces the delta log size.\nThe\ndelta.logRetentionDuration\nproperty configures how long you can go back in time. The default value is 30 days. You need to use\nALTER TABLE\nto modify existing property values.\n%sql\r\n\r\nALTER TABLE \r\nSET TBLPROPERTIES ('delta.logRetentionDuration'='7 days')\nDelete\nInfo\nBefore you modify table properties, you must ensure there are no active writes happening on the table.\nRun\nVACUUM\ndaily\nIf you reduce log retention to seven days (thereby limiting time travel to seven days) you can run\nVACUUM\non a daily basis.\nThis deletes stale data files that are older than sever days, every day. This is a good way to avoid stale data files and reduce you storage costs.\nDelete\nWarning\nIf the Delta table is the source for a streaming query, and if the streaming query falls behind by more than seven days, then the streaming query will not be able to correctly read the table as it will be looking for data that has already been deleted. You should only run a daily\nVACUUM\nif you know that all queries will never ask for data that is more than seven days old.\nAfter testing and verification on a small table, you can schedule\nVACUUM\nto run everyday via a job.\nSchedule\nVACUUM\nto run using a job cluster, instead of running it manually on all-purpose clusters, which may cost more.\nUse auto-scaling cluster when configuring the job to save costs.\nSummary\nTo improve\nVACUUM\nperformance:\nAvoid over-partitioned directories\nAvoid concurrent runs (during\nVACUUM\n)\nAvoid enabling cloud storage file versioning\nIf you run a periodic\nOPTIMIZE\ncommand,  enable\nautoCompaction\n/\nautoOptimize\non the delta table\nUse a current Databricks Runtime\nUse auto-scaling clusters with compute optimized worker types\nIn addition, if your application allows for it:\nIncrease the trigger frequency of any streaming jobs that write to your Delta table\nReduce the log retention duration of the Delta table\nPerform a periodic\nVACUUM\nThese additional steps further increase\nVACUUM\nperformance and can also help reduce storage costs." +} \ No newline at end of file diff --git a/scraped_kb_articles/vacuum-operations-not-performing-even-after-enabling-predictive-optimization.json b/scraped_kb_articles/vacuum-operations-not-performing-even-after-enabling-predictive-optimization.json new file mode 100644 index 0000000000000000000000000000000000000000..a44cba7cd3c6aea857ebe1ab8cdde27b2e29a9a0 --- /dev/null +++ b/scraped_kb_articles/vacuum-operations-not-performing-even-after-enabling-predictive-optimization.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/vacuum-operations-not-performing-even-after-enabling-predictive-optimization", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nPredictive optimization does not trigger\nVACUUM\neven though it is enabled and there are old unreferenced files present in the Delta table.\nCause\nIf the table with predictive optimization enabled has no active usage, predictive optimization does not see this table and previously deleted data.\nSolution\nCreate a new deletion operation to trigger\nVACUUM\nafter enabling predictive optimization.\nCommit any operation that could result in unreferenced files to the table, such as\nDELETE\nor\nMERGE\n. This creates unreferenced file information to pass to predictive optimization as eligible for\nVACUUM\n.\nWait for\ndeletedFileRetentionDuration\nto pass.\nCheck that\nVACUUM\nhas been executed.\nFor more information, please review the\nPredictive optimization for Delta Lake\n(\nAWS\n|\nAzure\n) documentation.\nNote\nOnce a table has usage,\nVACUUM\nis triggered seven days after predictive optimization is enabled." +} \ No newline at end of file diff --git a/scraped_kb_articles/vector-search-index-contains-incorrect-number-of-rows-.json b/scraped_kb_articles/vector-search-index-contains-incorrect-number-of-rows-.json new file mode 100644 index 0000000000000000000000000000000000000000..d63df638f45659427cc9bc552ccbb7dd6a2fd38e --- /dev/null +++ b/scraped_kb_articles/vector-search-index-contains-incorrect-number-of-rows-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/vector-search-index-contains-incorrect-number-of-rows-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou find your vector search index does not contain the expected number of rows.\nExample\nYou upload your data, housed in different spreadsheets, to a Unity Catalog Volume. The data is parsed out of each spreadsheet using LangChain and each individual record is then loaded to a Delta table with Apache Spark.\nYour Delta table contains 475 rows and two columns. However, when you create a new vector search index, the resulting Delta table only contains six rows instead of the expected 475.\nCause\nThe vector search index requires a unique column for the primary key.\nSolution\nEnsure that the source Delta table has a pre-existing unique column or add a new column prior to index creation.\nFor more information on creating a vector search index, please review\nHow to create and query a Vector Search index\n(\nAWS\n|\nAzure\n) as well as\nMosaic AI Vector Search\n(\nAWS\n|\nAzure\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/vector-search-index-does-not-sync-or-update-and-gives-error-execution_service_startup_failurestorage_permission_issue.json b/scraped_kb_articles/vector-search-index-does-not-sync-or-update-and-gives-error-execution_service_startup_failurestorage_permission_issue.json new file mode 100644 index 0000000000000000000000000000000000000000..261ac3bcfbf71cb1517724c4ccd38c91c65f4dde --- /dev/null +++ b/scraped_kb_articles/vector-search-index-does-not-sync-or-update-and-gives-error-execution_service_startup_failurestorage_permission_issue.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta-live-tables/vector-search-index-does-not-sync-or-update-and-gives-error-execution_service_startup_failurestorage_permission_issue", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen attempting to create a vector search index on your Delta tables, you notice the index will not sync or update. You receive an error in your DLT pipeline.\ncom.databricks.pipelines.common.CustomException: [DLT ERROR CODE: EXECUTION_SERVICE_STARTUP_FAILURE.STORAGE_PERMISSION_ISSUE]\r\nOperation failed: \"This request is not authorized to perform this operation.\"\r\n\r\nCaused by: com.databricks.pipelines.common.CustomException: [DLT ERROR CODE: EXECUTION_SERVICE_STARTUP_FAILURE.STORAGE_PERMISSION_ISSUE] Operation failed: \"This request is not authorized to perform this operation.\", 403, GET, https://xxxxxxxxx.dfs.core.windows.net/honinstallbase?upn=false&resource=filesystem&maxResults=5000&directory=igs/__unitystorage/schemas/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/tables/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/_delta_log&continuation=xxxxxxxxxxx=&timeout=90&recursive=false&st=2024-07-21T13:04:55Z&sv=2020-02-10&ske=2024-07-21T15:04:55Z&sig=XXXXX&sktid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&se=2024-07-21T14:28:39Z&sdd=6&skoid=xxxxxxxx-xxxx-xxxxxxxxxxxxxxxxxxxxxx&spr=https&sks=b&skt=2024-07-21T13:04:55Z&sp=rl&skv=2020-10-02&sr=d, AuthorizationFailure, \"This request is not authorized to perform this operation. RequestId:44ea689f-e01f-0076-7671-db2675000000 Time:2024-07-21T13:28:42.2309941Z\"\nCause\nEnabling an Azure Storage firewall on your storage account prevents serverless compute from accessing the account. Without access to the account, there is no access to the managed DLT pipeline.\nSolution\nUpdate your storage account firewall settings to allow the Databricks serverless compute subnets, then retrigger the DLT pipeline.\nFor instructions, please review the\nConfigure a firewall for serverless compute access\ndocumentation.\nNote\nAnytime you want to access a storage account from a DLT pipeline using serverless compute, and the account has a firewall enabled, you need to configure the firewall to allow serverless to access the account." +} \ No newline at end of file diff --git a/scraped_kb_articles/vector-search-queries-with-the-percent-character-not-performing-partial-string-matches.json b/scraped_kb_articles/vector-search-queries-with-the-percent-character-not-performing-partial-string-matches.json new file mode 100644 index 0000000000000000000000000000000000000000..b5188eeba01375096eef44991ccf8bd2f4ba3f27 --- /dev/null +++ b/scraped_kb_articles/vector-search-queries-with-the-percent-character-not-performing-partial-string-matches.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/machine-learning/vector-search-queries-with-the-percent-character-not-performing-partial-string-matches", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile querying a vector search index, The\nLIKE\noperator fails to perform partial-string matches, but does perform complete-string matches.\nCause\nIn vector search filters, the `%` character is treated as a literal character without any wildcard functionality. In SQL, the `%` character is used as a wildcard to match any sequence of characters.\nSolution\nWhen using the\nLIKE\noperator in vector search filters, specify the exact string you want to match. Do not use the `%` character as a wildcard.\nExample\nIf you want to match any string that contains \"value\" as a substring, use the following vector search filter.\n`{\"source\": \"value\"}`\nThis will match any string that contains \"value\" as a substring, regardless of whether it is at the beginning, middle, or end of the string." +} \ No newline at end of file diff --git a/scraped_kb_articles/verify-log4j-version.json b/scraped_kb_articles/verify-log4j-version.json new file mode 100644 index 0000000000000000000000000000000000000000..e02206b749cbafea2973ddf305182a05fcac8725 --- /dev/null +++ b/scraped_kb_articles/verify-log4j-version.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/verify-log4j-version", + "title": "Título do Artigo Desconhecido", + "content": "Databricks recently published a blog on\nLog4j 2 Vulnerability (CVE-2021-44228) Research and Assessment\n. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable.\nIf you are using Log4j within your cluster (for example, if you are processing user-controlled strings through Log4j), your use may be potentially vulnerable to the exploit if you have installed, and are using, an affected version or have installed services that transitively depend on an affected version.\nThis article explains how to check your cluster for installed versions of Log4j 2 and how to upgrade those instances.\nDelete\nWarning\nDISCLAIMER: The suggestions provided in this article reflect Databricks’s best understanding of the ways to make these determinations at this time. Because we do not control your code, we cannot guarantee that if you fail to find Log4j by following these directions or using the suggested scanners, that affected Log4j code is not present in your code.\nCheck to see if Log4j 2 is installed\nCheck for a manual install\nManually review the libraries installed on your cluster (\nAWS\n|\nAzure\n|\nGCP\n).\nIf you have explicitly installed a version of Log4j 2 via Maven, it is listed under\nLibraries\nin the cluster UI (\nAWS\n|\nAzure\n|\nGCP\n).\nScan the classpath\nScan your classpath to check for a version of Log4j 2.\nStart your cluster.\nAttach a notebook to your cluster.\nRun this code to scan your classpath:\n%scala\r\n\r\n{\r\n  import scala.util.{Try, Success, Failure}\r\n  import java.lang.ClassNotFoundException\r\n  Try(Class.forName(\"org.apache.logging.log4j.core.Logger\", false, this.getClass.getClassLoader)) match {\r\n    case Success(loggerCls) =>\r\n      Option(loggerCls.getPackage) match {\r\n          case Some(pkg) =>\r\n            println(s\"Version: ${pkg.getSpecificationTitle} ${pkg.getSpecificationVersion}\")\r\n          case None =>\r\n            println(\"Could not determine Log4J 2 version\")\r\n      }\r\n    case Failure(e: ClassNotFoundException) =>\r\n      println(\"Could not load Log4J 2 class\")\r\n    case Failure(e) =>\r\n      println(s\"Unexpected Error: $e\")\r\n      throw e\r\n  }\r\n}\nIf Log4j 2 is NOT PRESENT on your classpath, you see a result like this:\nCould not load Log4J 2 class\nIf Log4j 2 is PRESENT on your classpath, you should see a result like this, which includes the Log4j 2 version:\nVersion: Apache Log4j Core 2.15.0\nDelete\nInfo\nThis method does not identify cases where Log4j classes are shaded or included transitively.\nScan all user installed jars\nLocate all of the user installed jar files on your cluster and run a scanner to check for vulnerable Log4j 2 versions.\nStart your cluster.\nAttach a notebook to your cluster.\nRun this code to identify the location of the jar files:\n%scala\r\n\r\nimport org.apache.spark._\r\n\r\nval sparkEnv = SparkEnv.get\r\nval field = SparkEnv.get.getClass.getDeclaredField(\"driverTmpDir\")\r\nfield.setAccessible(true)\r\nprintln(s\"Your jars are installed under ${field.get(sparkEnv).asInstanceOf[Option[String]].get}\\n\")\nThe code displays the location of your jar files.\nYour jars are installed under /local_disk0/spark-1a6be695-9318-463c-b966-256c32e3771c/userFiles-582ca64b-93c9-444c-85b8-7779bd2c5e52\nDownload the jar files to your local machine.\nRun a scanner like\nLogpresso\nto check for vulnerable Log4j 2 versions.\nDelete\nWarning\nDISCLAIMER: The Logpresso scanner is open source software provided by a third party. Databricks makes no representations of any kind regarding the function or quality of Logpresso.\nUpgrade your Log4j 2 version\nUpgrade via cluster UI\nIf you manually installed Log4j 2 via the cluster UI, ensure that it is version 2.17 or above. In this case, no action is required.\nIf you manually installed Log4j 2 via the cluster UI, and it is 2.16 or below, you should uninstall the library from the cluster (\nAWS\n|\nAzure\n|\nGCP\n) and install version 2.17 or above.\nDelete\nInfo\nIf Log4j 2 is a transitive dependency for another library, upgrade the library that uses Log4j 2 to a secure version. You can also exclude the Log4j 2 package when pulling in an outdated library, and explicitly include a secure version of Log4j 2. This is not guaranteed to work.\nUpgrade via command line\nIf you have installed Log4j 2 via command line (or via SSH), use the same method to upgrade Log4j 2 to a secure version.\nUpgrade custom built jar\nIf you include Log4j 2 in a custom built jar, upgrade Log4j 2 to a secure version and rebuild your jar.\nRe-attach the updated jar to your cluster.\nRestart your cluster after upgrading\nRestart your cluster after upgrading Log4j 2." +} \ No newline at end of file diff --git a/scraped_kb_articles/verify-r-packages-installed-init.json b/scraped_kb_articles/verify-r-packages-installed-init.json new file mode 100644 index 0000000000000000000000000000000000000000..af876787ea4b040a166b0839d038a7e4068dc985 --- /dev/null +++ b/scraped_kb_articles/verify-r-packages-installed-init.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/r/verify-r-packages-installed-init", + "title": "Título do Artigo Desconhecido", + "content": "When you configure R packages to install via an init script, it is possible for a package install to fail if dependencies are not installed.\nYou can use the R commands in a notebook to check that all of the packages correctly installed.\nDelete\nInfo\nThis article does require you to provide a list of packages to check against.\nList installed packages\nMake a list of all R package names that you have listed in your init script or scripts.\nEnter the list of packages in this sample code.\n%r\r\n\r\nmy_packages <- list(\"\", \"\", \"\" )\r\nfind.package(my_packages, quiet=TRUE)\nThe output is a list of all installed packages.\nVerify the output against the input list to ensure that all packages were successfully installed.\nList packages that did not install\nMake a list of all R package names that you have listed in your init script or scripts.\nEnter the list of packages in this sample code.\n%r\r\n\r\nmy_packages <- c(\"\", \"\", \"\" )\r\nnot_installed <- my_packages[!(my_packages %in% installed.packages()[ , \"Package\"])]\r\nprint(not_installed)\nThe output is a list of all packages that failed to install.\nIf you have packages that are consistently failing to install, you should enable cluster log delivery and review the cluster logs for failures." +} \ No newline at end of file diff --git a/scraped_kb_articles/virtualenv-creation-failure-due-to-setuptools-=-7100-.json b/scraped_kb_articles/virtualenv-creation-failure-due-to-setuptools-=-7100-.json new file mode 100644 index 0000000000000000000000000000000000000000..cffd5d575e9b26448d7a2b97f345b47c97bc9ff3 --- /dev/null +++ b/scraped_kb_articles/virtualenv-creation-failure-due-to-setuptools-=-7100-.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/virtualenv-creation-failure-due-to-setuptools-=-7100-", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try to execute a notebook with an interactive cluster or job cluster using Databricks Workflows, the cluster does not execute the notebook. In the cluster logs, you observe errors like:\n20/01/01 00:00:00 ERROR Utils: Process List(virtualenv, /local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, -p, /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python, --no-download, --no-setuptools, --no-wheel) exited with code 1, and RuntimeError: failed to query /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python with code 1 err: 'Traceback (most recent call last):\\n  File \"/usr/local/lib/python3.10/dist-packages/virtualenv/discovery/py_info.py\", line 543, in \\n  info = PythonInfo()._to_json()\\n  File \"/usr/local/lib/python3.10/dist-packages/virtualenv/discovery/py_info.py\", line 90, in __init__\\n  self.distutils_install = {u(k): u(v) for k, v in self._distutils_install().items()}\\n  File \"/usr/local/lib/python3.10/dist-packages/virtualenv/discovery/py_info.py\", line 165, in _distutils_install\\n  i.finalize_options()\\n  File \"/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/setuptools/command/install.py\", line 57, in finalize_options\\n  super().finalize_options()\\n  File \"/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/setuptools/_distutils/command/install.py\", line 407, in finalize_options\\n  \\'dist_fullname\\': self.distribution.get_fullname(),\\n  File \"/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/setuptools/_core_metadata.py\", line 266, in get_fullname\\n  return _distribution_fullname(self.get_name(), self.get_version())\\n  File \"/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/setuptools/_core_metadata.py\", line 284, in _distribution_fullname\\n  canonicalize_version(version, strip_trailing_zero=False),\\nTypeError: canonicalize_version() got an unexpected keyword argument \\'strip_trailing_zero\\'\\n'\r\n20/01/01 00:00:00 ERROR VirtualenvCloneHelper: Encountered error during virtualenv creation org.apache.spark.SparkException: Process List(virtualenv, /local_disk0/.ephemeral_nfs/envs/pythonEnv-XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX, -p, /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python, --no-download, --no-setuptools, --no-wheel) exited with code 1.\nCause\nThe\nsetuptools Python package\nmade a change in version 71.0.0, resulting in Databricks Runtime not being able to create a virtual environment.\nThe highest version of setuptools provided by Databricks Runtime is 68.0.0 from 15.3. This means that versions higher than 68.0.0 are provided by the user’s Python job, directly or indirectly as a dependency.\nSolution\nAs of July 17th (2024), the latest compatible version of setuptools with Databricks Runtime 9.1 LTS - 15.3 is 70.3.0. Please pin setuptools to version 70.3.0. You have three options.\nNotebook-level\nAt the start of your notebook, include the following line to pin the setuptools version to 70.3.0:\n%pip install setuptools==70.3.0\nFor more information, please review the\nNotebook-scoped Python libraries\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nCluster library UI\nFrom the Databricks navigation sidebar, navigate to\nCompute\n>\n(Your cluster)\n>\nLibraries\n. In here, you can pin the setuptools on a cluster-level by providing the following string:\nsetuptools==70.3.0\nFor more information, please review the\nCluster libraries\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\n(Global) init script\nInit scripts are executed during cluster start-up and can ensure that the setuptools is pinned to the appropriate version. Inside the init script, you can add the following script to pin the setuptools version:\n#/bin/bash\n/databricks/python/bin/pip install setuptools==70.3.0\nFor more information, please review the\nWhat are init scripts?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nBonus: Find the source of the unversioned setuptools installation\nIf you would like to understand the mechanics behind dependencies installing unpinned setuptools versions, you can use\npipdeptree\nto clarify the dependency resolution of all the dependencies on a cluster.\nInstall pipdeptree directly in a notebook:\n%pip install pipdeptree\nDisplay the dependency tree by executing the following command in a notebook:\n%sh pipdeptree\nFor more information, please review the\npipdeptree documentation\n." +} \ No newline at end of file diff --git a/scraped_kb_articles/volume-log-cluster-configuration-detail-not-displayed-using-dbcli-or-sdk.json b/scraped_kb_articles/volume-log-cluster-configuration-detail-not-displayed-using-dbcli-or-sdk.json new file mode 100644 index 0000000000000000000000000000000000000000..d780ae6c82fdc32be9bd8ea26187110468e7380f --- /dev/null +++ b/scraped_kb_articles/volume-log-cluster-configuration-detail-not-displayed-using-dbcli-or-sdk.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dev-tools/volume-log-cluster-configuration-detail-not-displayed-using-dbcli-or-sdk", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re using the Databricks CLI with the\ndatabricks clusters get \nmethod or the SDK with the\nclusters.list()\nmethod to set a cluster log configuration to a volume path. You notice the\nClusterLogConf\nis displayed as empty\n{}\nor\nnone\n.\nCause\nYour Databricks CLI or SDK is out of date.\nSolution\nUpdate your developer tool library. Databricks recommends that you periodically install the latest available version of the Databricks CLI (\ndatabricks-cli 0.18.0\n) and Databricks SDK (\ndatabricks-sdk 0.57.0\n) from PiPy. These updates contain the latest feature logic and API improvements." +} \ No newline at end of file diff --git a/scraped_kb_articles/wasb-check-blob-types.json b/scraped_kb_articles/wasb-check-blob-types.json new file mode 100644 index 0000000000000000000000000000000000000000..a62120a5508e4824a874cf9d50523df60cbad9d7 --- /dev/null +++ b/scraped_kb_articles/wasb-check-blob-types.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data-sources/wasb-check-blob-types", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen you try reading a file on WASB with Spark, you get the following exception:\norg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 19, 10.139.64.5, executor 0): shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.\nWhen you try listing files in WASB using\ndbutils.fs.ls\nor the Hadoop API, you get the following exception:\njava.io.FileNotFoundException: File/ does not exist.\nCause\nThe WASB filesystem supports three types of blobs: block, page, and append.\nBlock blobs are optimized for upload of large blocks of data (the default in Hadoop).\nPage blobs are optimized for random read and write operations.\nAppend blobs are optimized for append operations.\nSee\nUnderstanding block blobs, append blobs, and page blobs\nfor details.\nThe errors described above occur if you try to read an append blob or list a directory that contains only append blobs. The Databricks and\nHadoop Azure\nWASB implementations do not support reading append blobs. Similarly when listing a directory, append blobs are ignored.\nThere is no workaround to enable reading append blobs or listing a directory that contains only append blobs. However, you can use either Azure CLI or Azure Storage SDK for Python to identify if a directory contains append blobs or a file is an append blob.\nYou can verify whether a directory contains append blobs by running the following Azure CLI command:\naz storage blob list \\\r\n  --auth-mode key \\\r\n  --account-name \\\r\n  --container-name \\\r\n  --prefix \nThe result is returned as a JSON document, in which you can easily find the blob type for each file.\nIf directory is large, you can limit number of results with the flag\n--num-results \n.\nYou can also use Azure Storage SDK for Python to list and explore files in a WASB filesystem:\n%python\r\n\r\n\r\niter = service.list_blobs(\"container\")\r\nfor blob in iter:\r\n  if blob.properties.blob_type == \"AppendBlob\":\r\n    print(\"\\t Blob name: %s, %s\" % (blob.name, blob.properties.blob_type))\nDatabricks does support accessing append blobs using the Hadoop API, but only when appending to a file.\nSolution\nThere is no workaround for this issue.\nUse Azure CLI or Azure Storage SDK for Python to identify if the directory contains append blobs or the object is an append blob.\nYou can implement either a Spark SQL UDF or custom function using RDD API to load, read, or convert blobs using Azure Storage SDK for Python." +} \ No newline at end of file diff --git a/scraped_kb_articles/who-deleted-cluster.json b/scraped_kb_articles/who-deleted-cluster.json new file mode 100644 index 0000000000000000000000000000000000000000..2aac6ab139ae066c716870c3f5f884e2d2e8a789 --- /dev/null +++ b/scraped_kb_articles/who-deleted-cluster.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/who-deleted-cluster", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/who-deleted-workspace.json b/scraped_kb_articles/who-deleted-workspace.json new file mode 100644 index 0000000000000000000000000000000000000000..bf0c63e7559b040221ab1fb1b4abeada7fb11145 --- /dev/null +++ b/scraped_kb_articles/who-deleted-workspace.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/administration/who-deleted-workspace", + "title": "Título do Artigo Desconhecido", + "content": "" +} \ No newline at end of file diff --git a/scraped_kb_articles/withcolumn-operation-when-using-in-loop-slows-performance.json b/scraped_kb_articles/withcolumn-operation-when-using-in-loop-slows-performance.json new file mode 100644 index 0000000000000000000000000000000000000000..c0228f0e6afe222c2ec31792f6d5adb61e1dc81b --- /dev/null +++ b/scraped_kb_articles/withcolumn-operation-when-using-in-loop-slows-performance.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/scala/withcolumn-operation-when-using-in-loop-slows-performance", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhen using the Apache Spark\nwithColumn\noperation multiple times (such as looping), you notice slow performance or a\nStackOverflowException\n.\nCause\nEach\nwithColumn\ncall introduces a new projection internally, which generates large execution plans.\nSolution\nUse a select operation on multiple columns at once. Select casts all columns to IntegerType more efficiently by performing all transformations in a single operation.\nval df2 = df1.select(df1.columns.map { col =>\r\n  df1(col).cast(IntegerType)\r\n}: _*)\nFor more information, please review the Spark\nDataset\ndocumentation under the\nwithColumn\nsection." +} \ No newline at end of file diff --git a/scraped_kb_articles/workflows-are-failing-with-a-could-not-reach-driver-of-the-cluster-error.json b/scraped_kb_articles/workflows-are-failing-with-a-could-not-reach-driver-of-the-cluster-error.json new file mode 100644 index 0000000000000000000000000000000000000000..7456f2949e012cfd743f2c5e3396ca1ff3b5f3d4 --- /dev/null +++ b/scraped_kb_articles/workflows-are-failing-with-a-could-not-reach-driver-of-the-cluster-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/workflows-are-failing-with-a-could-not-reach-driver-of-the-cluster-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou have jobs and/or workflows that are failing with the error message\nCould not reach driver of the cluster\n.\nCause\nThe most common cause is high\nsystem.gc()\npauses on the driver node as well as high CPU and memory utilization, which leads to throttling, and prevents the driver from responding within the allocated time. This can be caused by running multiple jobs concurrently on a single cluster.\nTo verify high\nsystem.gc()\npauses in the driver, you can review the logs.\nClick\nCompute\n.\nClick the name of your cluster.\nClick\nDriver logs\n.\nClick\nstdout\nto display the logs.\nReview the garbage collection log lines to see the time taken. This example log shows garbage collection time taken is very high and confirms high\nsystem.gc()\npauses as the cause.\n[25229.348s][info][gc     ] GC(33) Pause Young (System.gc()) 20386M->896M(213022M) 11.509ms\r\n[25229.535s][info][gc     ] GC(34) Pause Full (System.gc()) 896M->705M(213022M) 187.301ms\r\n[27029.347s][info][gc     ] GC(35) Pause Young (System.gc()) 20334M->919M(213004M) 10.893ms\r\n[27029.525s][info][gc     ] GC(36) Pause Full (System.gc()) 919M->707M(213004M) 177.894ms\nYou can verify if high CPU and memory utilization is the cause, by reviewing the utilization metrics in the\ncompute metrics\n(\nAWS\n|\nAzure\n|\nGCP\n).\nIf you have high utilization the graph shows CPU or memory usage above 85%.\nAnother common cause is when the default REPL timeout is too short for the specific workload, causing the kernel to fail to start within the allocated time.\nSolution\nIf the root cause is high\nsystem.gc()\npauses, high CPU utilization, or high memory utilization you should use a larger driver instance to accommodate the increased resource requirements.\nIf the root cause is a too-short REPL timeout, you can increase it with a\ncluster-scoped init script\n(\nAWS\n|\nAzure\n|\nGCP\n).\nExample init script\nThis sample code creates an init script that sets the REPL timeout to 150 seconds, providing more time for the kernel to start. It stores the init script as a workspace file.\nBefore running the sample code, replace\n\nwith the full path to the location in your workspace where you want to store the init script.\nInfo\nDatabricks Runtime 11.3 LTS and above is required to use init scripts stored as workspace files.\n%python\r\n\r\ninitScriptContent = \"\"\"\r\n#!/bin/bash\r\ncat > /databricks/common/conf/set_repl_timeout.conf << EOL\r\n{\r\n  databricks.daemon.driver.launchTimeout = 150\r\n}\r\nEOL\r\n\"\"\"\r\ndbutils.fs.put(\"/Workspace//set_repl_timeout.sh\",initScriptContent, True)\nBest practices\nAvoid running multiple jobs concurrently on a single cluster.\nRegularly monitor CPU, memory, and disk usage metrics to ensure that your clusters have sufficient resources to handle the workload. Adjust cluster configurations or scale-up as needed.\nChoose driver and worker instance types that match your workload requirements. Consider using larger instances for resource-intensive workloads." +} \ No newline at end of file diff --git a/scraped_kb_articles/writestreamreadstream-leads-to-an-error-when-the-schema-contains-nulltype.json b/scraped_kb_articles/writestreamreadstream-leads-to-an-error-when-the-schema-contains-nulltype.json new file mode 100644 index 0000000000000000000000000000000000000000..8623c60f9fe17774e0156aa928a6d45ef977bfdd --- /dev/null +++ b/scraped_kb_articles/writestreamreadstream-leads-to-an-error-when-the-schema-contains-nulltype.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/streaming/writestreamreadstream-leads-to-an-error-when-the-schema-contains-nulltype", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou’re executing a Delta streaming task in Delta Lake and encounter an error.\nTypeError: cannot unpack non-iterable NoneType object.\nWhen reviewing the error stack, you see your requested schema contains special columns that can only be produced by\nVectorizedParquetRecordReader\nand data types which that reader doesn’t support.\nCause\nParquet doesn't support\nNullType\n, and when you request to cast the\nNullType\ncolumn, you’re asking the Delta reader to read a column from a Parquet file. The code fails at the reading stage before reaching the casting operation.\nSolution\nDuring data reading, replace\nNullType\nwith a literal\nNone\nvalue and then cast it to\nStringType\n. This approach ensures that the\nNullType\ncolumn is replaced with a valid data type (\nStringType\n) so further processing can proceed.\nfrom pyspark.sql.functions import lit\r\n\r\nstreaming_df = streaming_df.withColumn('', lit(None).cast(StringType()))" +} \ No newline at end of file diff --git a/scraped_kb_articles/wrong-schema-in-files.json b/scraped_kb_articles/wrong-schema-in-files.json new file mode 100644 index 0000000000000000000000000000000000000000..8e84065de0fb94e53a8ccf05926ec52055b6b0ec --- /dev/null +++ b/scraped_kb_articles/wrong-schema-in-files.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/data/wrong-schema-in-files", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nThe Spark job fails with an exception like the following while reading Parquet files:\nError in SQL statement: SparkException: Job aborted due to stage failure:\r\nTask 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0\r\n(TID 868031, 10.111.245.219, executor 31):\r\njava.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary\r\n    at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)\nCause\nThe\njava.lang.UnsupportedOperationException\nin this instance is caused by one or more Parquet files written to a Parquet folder with an incompatible schema.\nSolution\nFind the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled:\n%scala\r\n\r\nspark.read.option(\"mergeSchema\", \"true\").parquet(path)\nor\n%scala\r\n\r\nspark.conf.set(\"spark.sql.parquet.mergeSchema\", \"true\")\r\nspark.read.parquet(path)\nIf you do have Parquet files with incompatible schemas, the snippets above will output an error with the name of the file that has the wrong schema.\nYou can also check if two schemas are compatible by using the\nmerge\nmethod. For example, let’s say you have these two schemas:\n%scala\r\n\r\nimport org.apache.spark.sql.types._\r\n\r\nval struct1 = (new StructType)\r\n  .add(\"a\", \"int\", true)\r\n  .add(\"b\", \"long\", false)\r\n\r\nval struct2 = (new StructType)\r\n  .add(\"a\", \"int\", true)\r\n  .add(\"b\", \"long\", false)\r\n  .add(\"c\", \"timestamp\", true)\nThen you can test if they are compatible:\n%scala\r\n\r\nstruct1.merge(struct2).treeString\nThis will give you:\n%scala\r\n\r\nres0: String =\r\n\"root\r\n|-- a: integer (nullable = true)\r\n|-- b: long (nullable = false)\r\n|-- c: timestamp (nullable = true)\r\n\"\nHowever, if\nstruct2\nhas the following incompatible schema:\n%scala\r\n\r\nval struct2 = (new StructType)\r\n  .add(\"a\", \"int\", true)\r\n  .add(\"b\", \"string\", false)\nThen the test will give you the following\nSparkException\n:\norg.apache.spark.SparkException: Failed to merge fields 'b' and 'b'. Failed to merge incompatible data types LongType and StringType" +} \ No newline at end of file diff --git a/scraped_kb_articles/xlsx-file-not-supported-xlrd.json b/scraped_kb_articles/xlsx-file-not-supported-xlrd.json new file mode 100644 index 0000000000000000000000000000000000000000..820b3b59fbb70ab7d16fd2607d20a5aa27487421 --- /dev/null +++ b/scraped_kb_articles/xlsx-file-not-supported-xlrd.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/libraries/xlsx-file-not-supported-xlrd", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are have\nxlrd\ninstalled on your cluster and are attempting to read files in the Excel .xlsx format when you get an error.\nXLRDError: Excel xlsx file; not supported\nCause\nxlrd\n2.0.0 and above can only read .xls files.\nSupport for .xlsx files was removed from\nxlrd\ndue to a potential security vulnerability.\nSolution\nUse\nopenpyxl\nto open .xlsx files instead of\nxlrd\n.\nInstall the\nopenpyxl\nlibrary on your cluster (\nAWS\n|\nAzure\n|\nGCP\n).\nConfirm that you are using\npandas\nversion 1.0.1 or above.\n%python\r\n\r\nimport pandas as pd\r\nprint(pd.__version__)\nSpecify\nopenpyxl\nwhen reading .xlsx files with\npandas\n.\n%python\r\n\r\nimport pandas\r\ndf = pandas.read_excel(`.xlsx`, engine=`openpyxl`)\nRefer to the\nopenpyxl documentation\nfor more information." +} \ No newline at end of file diff --git a/scraped_kb_articles/xml-file-read-executes-slowly.json b/scraped_kb_articles/xml-file-read-executes-slowly.json new file mode 100644 index 0000000000000000000000000000000000000000..99dee40098396d3a80d5b5c77f3d6a3e565b5ec8 --- /dev/null +++ b/scraped_kb_articles/xml-file-read-executes-slowly.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/clusters/xml-file-read-executes-slowly", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile reading a large XML file from any storage location in Databricks, you notice slow performance.\nCause\nNo schema is provided. Without a provided schema, Apache Spark must scan the file to infer the schema.\nWith XML files, schema inference is not splittable and scanning is not parallelizable. Processing the entire file and reading it are conducted as a single task, which takes longer.\nSolution\nDefine the schema explicitly when reading XML files by adding it to the read operation parameters. For example,\nschema(\"content STRING,item_id long\")\n.\nAlternatively, use Auto Loader for file ingestion, which caches the schema after the first inference to avoid repeated overhead. For more information, refer to the “Schema inference and evolution in Auto Loader” section of the\nRead and write XML files\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation." +} \ No newline at end of file diff --git a/scraped_kb_articles/you-do-not-use-deletion-vectors-but-see-a-file-named-deletion-vector-in-your-data-path.json b/scraped_kb_articles/you-do-not-use-deletion-vectors-but-see-a-file-named-deletion-vector-in-your-data-path.json new file mode 100644 index 0000000000000000000000000000000000000000..8b1d3b53ca2804620af5130316b256aed21f435f --- /dev/null +++ b/scraped_kb_articles/you-do-not-use-deletion-vectors-but-see-a-file-named-deletion-vector-in-your-data-path.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/you-do-not-use-deletion-vectors-but-see-a-file-named-deletion-vector-in-your-data-path", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are reviewing your data path and you come across a file with \"deletion vector\" in its name and a .bin extension. Because of the name, you may assume it is a Delta Lake\ndeletion vector\n(\nAWS\n|\nAzure\n|\nGCP\n), but this would be incorrect. This file can be created, even if you do not use Delta Lake deletion vectors.\ndbutils.fs.ls(\"s3://bucket_name/table_name/\")\r\n\r\n[FileInfo(path='s3://bucket_name/table_name/deletion_vector_1112222-33333-44444-5555-123455.bin', deletion_vector_1112222-33333-44444-5555-123455.bin', size=1024),\r\n FileInfo(path='s3://bucket_name/table_name/date=delta_log', name='date=delta_log/', size=0),\r\n FileInfo(path='s3://bucket_name/table_name/date=20241010', name='date=20241010/', size=0),\r\n ]\nCause\nThis file is generated by an internal process called\nlow shuffle merge\n(\nAWS\n|\nAzure\n|\nGCP\n). Introduced in Databricks Runtime 10.4 LTS, low shuffle merge uses a type of deletion vector mechanism.\nThe file in question is a temporary file created during a merge operation. Typically, this file is deleted once the merge operation completes. If the merge operation fails, this file may remain in place.\nSolution\nYou do not have to do anything. The file is automatically deleted on the next\nVACUUM\nrun." +} \ No newline at end of file diff --git a/scraped_kb_articles/you-get-an-insufficient-privileges-on-__databricks_internal-catalog-error-when-attempting-to-query-dlt-pipeline-views.json b/scraped_kb_articles/you-get-an-insufficient-privileges-on-__databricks_internal-catalog-error-when-attempting-to-query-dlt-pipeline-views.json new file mode 100644 index 0000000000000000000000000000000000000000..145523abf468d700f17a9aa521e6fdc7a5e18dd8 --- /dev/null +++ b/scraped_kb_articles/you-get-an-insufficient-privileges-on-__databricks_internal-catalog-error-when-attempting-to-query-dlt-pipeline-views.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/unity-catalog/you-get-an-insufficient-privileges-on-__databricks_internal-catalog-error-when-attempting-to-query-dlt-pipeline-views", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nWhile attempting to display sample data or run\nSELECT *\nqueries on DLT pipeline-created views using a job or cluster in single-user access mode, you encounter an\n[INSUFFICIENT_PERMISSIONS] Insufficient privileges\nerror message which indicates you do not have\nUSE CATALOG\npermission on the\n__databricks_internal\ncatalog.\nExample error message\n[INSUFFICIENT_PERMISSIONS] Insufficient privileges: User does not have USE CATALOG on Catalog '__databricks_internal'. SQLSTATE: 42501, data: {'type':'baseError','stackFrames':['org.apache.spark.sql.AnalysisException: [INSUFFICIENT_PERMISSIONS] Insufficient privileges:\\nUser does not have USE CATALOG on Catalog '__databricks_internal'.\nCause\nThe owner of a job with a cluster using single-user access mode is referencing a materialized view created by a DLT pipeline with a different owner.\nOnly the owner of a Databricks SQL materialized view can query the materialized view from a single-user access mode cluster\n.\nOtherwise, Databricks SQL materialized views can be queried only from Databricks SQL warehouses, Delta Live Tables, and shared clusters running Databricks Runtime 11.3 and above.\nNote\nThe\n__databricks_internal\ncatalog is an internal catalog used for DLT pipeline materializations. In general, you should not access these tables directly, which is why users do not have access to it by default.\nSolution\nUse a cluster in shared access mode for your job, or make sure you own both the materialized view and the job.\nImportant\nStrictly limiting permissions for the\n__databricks_internal\ncatalog  is a design decision. Access to this catalog should only be granted for debugging purposes.\nIf you need to access data in the\n__databricks_internal\ncatalog, you should request the necessary permissions from your administrator. As a best practice, only a limited number of use should have\nUSE CATALOG\non\n__databricks_internal\npermissions. It should not be granted to a user or group by default." +} \ No newline at end of file diff --git a/scraped_kb_articles/you-want-to-declare-temporary-variables-inside-a-function-in-databricks-sql.json b/scraped_kb_articles/you-want-to-declare-temporary-variables-inside-a-function-in-databricks-sql.json new file mode 100644 index 0000000000000000000000000000000000000000..f8c34ad34bc8c97468239f66c17b49d678debffd --- /dev/null +++ b/scraped_kb_articles/you-want-to-declare-temporary-variables-inside-a-function-in-databricks-sql.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/you-want-to-declare-temporary-variables-inside-a-function-in-databricks-sql", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou want to declare temporary variables inside a function in Databricks SQL, but can’t seem to do it.\nCause\nDatabricks SQL does not directly support this functionality.\nSolution\nYou must handle your logic using the function parameters and CTEs (Common Table Expressions) without traditional variable declarations.\nExample code\nIn this code, we are creating a function\n“my_generic_function”\nthat accepts a parameter and returns a table with a computed result. We use CTEs to simulate variable declarations.\nFunction Creation: The function\nmy_generic_function\naccepts a single integer parameter (\nparam_input\n) and returns a table with one column (\ncomputed_result\n).\nSimulated Variable: Since Databricks SQL doesn't support traditional variable declarations, we simulate a variable using a CTE named\nsimulated_variable\n.\nFinal Result Calculation: The second CTE (\nfinal_result\n) uses both the simulated variable and the function parameter to compute the result. In this example, it multiplies them together.\nBefore running this code in a notebook, replace\n\nwith the value you want as the variable’s value.\nCREATE FUNCTION my_generic_function(param_input INT)\r\nRETURNS TABLE(computed_result FLOAT)\r\nRETURN\r\n  WITH simulated_variable AS (\r\n    SELECT AS variable_value\r\n  ),\r\n  final_result AS (\r\n    SELECT variable_value * param_input AS computed_result\r\n    FROM simulated_variable\r\n  )\r\nSELECT * FROM final_result;\nThe final result uses the function parameter and the simulated variable to compute the result. Here,\nvariable_value\nis the variable declared inside a function in Databricks SQL." +} \ No newline at end of file diff --git a/scraped_kb_articles/zorder-results-in-hilbert-indexing-can-only-be-used-on-9-or-fewer-columns-error.json b/scraped_kb_articles/zorder-results-in-hilbert-indexing-can-only-be-used-on-9-or-fewer-columns-error.json new file mode 100644 index 0000000000000000000000000000000000000000..900cd27a8f1b7baada314b2bf3098c0af36f0a6b --- /dev/null +++ b/scraped_kb_articles/zorder-results-in-hilbert-indexing-can-only-be-used-on-9-or-fewer-columns-error.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/dbsql/zorder-results-in-hilbert-indexing-can-only-be-used-on-9-or-fewer-columns-error", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are running an\nOPTIMIZE ZORDER BY\n(\nAWS\n|\nAzure\n|\nGCP\n) command in\nDatabricks SQL\n(\nAWS\n|\nAzure\n|\nGCP\n) when you get an Apache Spark exception error:\nHilbert indexing can only be used on 9 or fewer columns\n.\nError in SQL statement: ExecutionException: org.apache.spark.SparkException: Hilbert indexing can only be used on 9 or fewer columns\nCause\nOPTIMIZE ZORDER BY\ncommand has a hard limit of nine columns. This is by design.\nSolution\nYou must reduce the number of columns to nine or less. The best practice is to use\nZORDER\non a maximum of three columns. When you use\nZORDER\non four or more columns, the effectiveness is reduced with each additional column used.\nUse\nZORDER\non the most commonly used query predicate columns (the columns in the query \"where\" clause). Databricks recommends that you use\nZORDER\non high cardinality columns.\nDelete\nInfo\nDelta Lake on Databricks collects statistics on the first 32 columns defined in your table schema." +} \ No newline at end of file diff --git a/scraped_kb_articles/zordering-ineffective-column-stats.json b/scraped_kb_articles/zordering-ineffective-column-stats.json new file mode 100644 index 0000000000000000000000000000000000000000..60b2c72e1a7f13cf8b6e24e374881dc223340db0 --- /dev/null +++ b/scraped_kb_articles/zordering-ineffective-column-stats.json @@ -0,0 +1,5 @@ +{ + "url": "https://kb.databricks.com/en_US/delta/zordering-ineffective-column-stats", + "title": "Título do Artigo Desconhecido", + "content": "Problem\nYou are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns.\nAnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns.\nDelete\nInfo\nPlease review Z-Ordering (multi-dimensional clustering) (\nAWS\n|\nAzure\n|\nGCP\n) for more information on data skipping and z-ordering.\nCause\nDelta Lake collects statistics on the first 32 columns defined in your table schema. If the columns you are attempting to Z-Order are not in the first 32 columns, no statistics are collected for those columns.\nSolution\nReorder the columns in your table, so the columns you are attempting to Z-Order are in the first 32 columns in your table.\nYou can use an\nALTER TABLE\nstatement to reorder the columns.\n%sql\r\n\r\nALTER TABLE table_name CHANGE [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name]\nFor example, this statement brings the column with\n\nto the first column in the table.\n%sql\r\n\r\nALTER TABLE CHANGE COLUMN FIRST\nRecompute the statistics after you have reordered the columns in the table.\n%scala\r\n\r\nimport com.databricks.sql.transaction.tahoe._\r\nimport org.apache.spark.sql.catalyst.TableIdentifier\r\nimport com.databricks.sql.transaction.tahoe.stats.StatisticsCollection\r\n\r\nval tableName = \"\"\r\nval deltaLog = DeltaLog.forTable(spark, TableIdentifier(tableName))\r\n\r\nStatisticsCollection.recompute(spark, deltaLog)\nRerun the Z-Order on the table and it should complete successfully." +} \ No newline at end of file