
Databricks log collection with Azure Monitor at desktop scale
Databricks is an optimized data analysis platform based on Apache Spark. Databricks platform monitoring is critical to ensuring data quality, job performance, and security issues by limiting access to production workspaces.
Spark application statistics, logs, and events produced by a Databricks workspace can be customized, sent, and centralized to various monitoring platforms, including Azure Monitor logs. This tool, previously called Log Analytics by Microsoft, is an Azure cloud service integrated into Azure Monitor that collects and stores logs from cloud and on-premises environments. It provides a means to query logs from data collected using a read-only query language called “Kusto”, to build dashboards for “Workbooks” and set alerts on detected patterns.
This article focuses on automating the export of Databricks logs to a Log Analytics workspace by using Spark monitoring study-scale library.
Overview of Databricks log shipping
This section is an overview of the architecture. More detailed information and associated source code can be found further down in the article.
Spark monitoring is a Microsoft Open Source project for exporting Databricks logs at the cluster level. Once downloaded, the library is built locally using Docker or Maven according to the Databricks Runtime version of the cluster to configure (Spark and Scala versions). The build of the library generates two jar files:
spark-listeners_${spark-version}_${scala_version}-${version}
: collects data from a running cluster;spark-listeners-loganalytics_${spark-version}_${scala_version}-${version}
: extendsspark-listeners
by collecting data, connecting to a Log Analytics workspace, analyzing and sending logs via the Data Collector API
In the documentation, when the jars are built, they are put on DBFS. An init script spark-monitoring.sh
edited locally with the workspace and cluster configurations and added manually via the Databricks cluster-level interface.
At cluster launch, logs are streamed in JSON format to the Log Analytics Data Collector API and stored in 3 different tables, one for each type of log sent:
- SparkMetric_CL: Execution metrics for Spark applications (memory usage, number of jobs, step tasks submitted/completed/running);
- SparkListenerEvent_CL: All events listened to by the SparkListener during the execution of the Spark application (jobs, stages and tasks start/end);
- SparkLoggingEvent_CL: Logs from log4j appender.
Some configurations allow automating the setup of log shipping at the workspace level by configuring all clusters in a given workspace. That means downloading the project, building it with Docker or Maven, editing spark-monitoring.sh
the script and the cluster environment variables. After all configurations are done, the Databricks workspace is configured by running the PowerShell script. It is based on 3 bash scripts:
spark-monitoring-vars.sh
: define the workspace environment variables;spark-monitoring.sh
: sends logs in streaming to Log Analytics;spark-monitoring-global-init.sh
: this script runs at workspace scalespark-monitoring-vars.sh
sincespark-monitoring.sh
.
The PowerShell script dbx-monitoring-deploy.ps1
runs locally and distributes configurations at the desktop level. It fills up spark-monitoring-vars.sh
with workspace variables, copying scripts and jars to DBFS and posting global init scripts to Databricks.
Configuration of a workspace
1. Build the jar files
Clone the repository Spark monitoring and locally build the jar files using Docker or Maven in the Databricks runtime versions of all clusters that need to be configured in the workspace according to the documentation.
With Docker:
At the root of spark-monitoring
folder, run the build command in the desired Spark and Scala versions. In this example, the library is built for Scala 2.12 and Spark 3.0.1.
docker run -it --rm -v pwd:/spark-monitoring -v "$HOME/.m2":/root/.m2 -w /spark-monitoring/src maven:3.6.3-jdk-8 mvn install -P "scala-2.12_spark-3.0.1"
Cans are built in spark-monitoring/src/target
folder. The spark-monitoring.sh
lies inside spark-monitoring/src/spark-listeners/scripts
folder.
All these steps are explained in the chapter Build the Azure Databricks monitoring library from Microsoft's Patterns and Practices GitHub repository.
2. Setting Log Analytics environment variables
The Log Analytics workspace ID and key are stored in Azure Key Vault secrets and referenced in the environment variables of all configured clusters. Azure Databricks accesses the key vault through the Databricks workspace Secret Scope.
After creating the secrets for the Log Analytics workspace ID and key, configure each cluster manually by referencing the secrets according to the instructions on how to set Azure Key Vault-backed Secret Scope.
LOG_ANALYTICS_WORKSPACE_KEY={{secrets/secret-scope-name/pw-log-analytics}}
LOG_ANALYTICS_WORKSPACE_ID={{secrets/secret-scope-name/id-log-analytics}}
3. Adding scripts for spark-monitoring-global-init.sh and spark-monitoring-vars.sh
Create an jars
folder, upload all jars and configuration files respecting the following file tree:
-
spark-monitoring-global-init.sh
: This script is started at the launch of each cluster in the workspace.#!/bin/bash STAGE_DIR=/dbfs/databricks/spark-monitoring VARS_SCRIPT=$STAGE_DIR/spark-monitoring-vars.sh MONITORING_SCRIPT=$STAGE_DIR/spark-monitoring.sh if [ -d "$STAGE_DIR" -a -f "$VARS_SCRIPT" -a -f "$MONITORING_SCRIPT" ]; then /bin/bash $VARS_SCRIPT; /bin/bash $MONITORING_SCRIPT; else echo "Directory $STAGE_DIR does not exist or one of the scripts needed is missing" fi
-
spark-monitoring-vars.sh
: This script is a template for all environment variables needed at cluster and workspace level.#!/bin/bash DB_HOME=/databricks SPARK_HOME=$DB_HOME/spark SPARK_CONF_DIR=$SPARK_HOME/conf tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF # Id of Azure subscription export AZ_SUBSCRIPTION_ID="$AZ_SUBSCRIPTION_ID" # Resource group name of workspace export AZ_RSRC_GRP_NAME="$AZ_RSRC_GRP_NAME" export AZ_RSRC_PROV_NAMESPACE=Microsoft.Databricks export AZ_RSRC_TYPE=workspaces # Name of Databricks workspace export AZ_RSRC_NAME="$AZ_RSRC_NAME" EOF
4. Edit and add spark-monitoring.sh
Copy spark-monitoring.sh
from the cloned project, add it to the file tree and edit environment variables like the following:
DB_HOME=/databricks
SPARK_HOME=$DB_HOME/spark
SPARK_CONF_DIR=$SPARK_HOME/conf
tee -a "$SPARK_CONF_DIR/spark-env.sh" << EOF
# Export cluster id and name from environment variables
export DB_CLUSTER_ID=$DB_CLUSTER_ID
export DB_CLUSTER_NAME=$DB_CLUSTER_NAME
EOF
Given the large storage costs associated with a Log Analytics workspace, in conjunction with Spark metrics, apply filters based on REGEX expressions to preserve only the most relevant log information. This event filtering documentation gives you the various variables to set.
5. Edit, add and run the PowerShell script
The script dbx-monitoring-deploy.ps1
used to configure the export of cluster logs from a Databricks workspace to Log Analytics.
It performs the following actions:
- Filling
spark-monitoring-vars.sh
with correct values for workspace. - Uploads
spark-monitoring-vars.sh
,spark-monitoring.sh
and all jar files on the DBFS workspace. - Posts via the Databricks API content in the global init script.
It assumes that there are 3 different Azure subscriptions (DEV/PREPROD/PROD) to separate development, testing and production phases of a continuous integration. A pre-production subscription is used for integration testing and enterprise acceptance testing before going into production.
Edit this section according to your subscriptions.
param(
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$p,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$e,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$n,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$rg,
[Parameter(Mandatory=$true,ValueFromPipeline=$true)]$w
)
$armFolder = $p.TrimEnd("/","\")
$deploymentName = $n.ToLower()
$varsTemplatePath = "$armFolder/spark-monitoring-vars.sh"
if ($e -like "dev")
{
$AZ_SUBSCRIPTION_ID = ""
}
elseif ($e -like 'prod') {
$AZ_SUBSCRIPTION_ID = ""
}
elseif ($e -like 'preprod') {
$AZ_SUBSCRIPTION_ID = ""
}
else{
Write-Output "no environment provided - exiting"
Exit-PSSession
}
$AZ_RSRC_GRP_NAME = $rg
$AZ_RSRC_NAME = $w
$environment = $e.ToLower()
$parametersPath = "$armFolder/$environment/$deploymentName/spark-monitoring-vars-$environment-$deploymentName.sh"
$template = Get-Content "$varsTemplatePath" -Raw
$filledTemplate = Invoke-Expression "@`"`r`n$template`r`n`"@"
mkdir -p $armFolder/$environment/$deploymentName
Out-File -FilePath $parametersPath -InputObject $filledTemplate
try {
$context = get-azContext
if(!$context)
{
Write-Output "No context, please connect !"
$Credential = Get-Credential
Connect-AzAccount -Credential $Credential -ErrorAction Stop
}
if ($environment -like "dev")
{
set-azcontext "AD-DEV01" -ErrorAction Stop
}
elseif ($environment -like 'prod') {
set-azcontext "AD-PROD01" -ErrorAction Stop
}
elseif ($environment -like 'preprod') {
set-azcontext "AD-PREPROD01" -ErrorAction Stop
}
else{
Write-Output "no context found for provided environment- exiting"
Exit
}
}
catch{
Write-Output "error setting context - exiting"
Exit
}
$mydbx=Get-AzDatabricksWorkspace -ResourceGroupName $AZ_RSRC_GRP_NAME
$hostVar = "https://" + $mydbx.Url
$myToken = Get-AzAccessToken -Resource "2ff814a6-3304-4ab8-85cb-cd0e6f879c1d"
$env:DATABRICKS_AAD_TOKEN=$myToken.Token
databricks configure --aad-token --host $hostVar
databricks fs mkdirs dbfs:/databricks/spark-monitoring
databricks fs cp --overwrite $armFolder/spark-monitoring.sh dbfs:/databricks/spark-monitoring
databricks fs cp --overwrite $armFolder/$environment/$deploymentName/spark-monitoring-vars-$environment-$deploymentName.sh dbfs:/databricks/spark-monitoring/spark-monitoring-vars.sh
databricks fs cp --recursive --overwrite $armFolder/jars dbfs:/databricks/spark-monitoring
$inputfile = "$armFolder/spark-monitoring-global-init.sh"
$fc = get-content $inputfile -Encoding UTF8 -Raw
$By = [System.Text.Encoding]::UTF8.GetBytes($fc)
$etext = [System.Convert]::ToBase64String($By, 'InsertLineBreaks')
$Body = @{
name = "monitoring"
script = "$etext"
position = 1
enabled = "true"
}
$JsonBody = $Body | ConvertTo-Json
$Uri = "https://" + $mydbx.Url + "/api/2.0/global-init-scripts"
$Header = @{Authorization = "Bearer $env:DATABRICKS_AAD_TOKEN"}
Invoke-RestMethod -Method Post -Uri $Uri -Headers $Header -Body $JsonBody
Enrich and start the script with these parameters:
Parameter | Description |
---|---|
page | The road to the script |
e | Environment (DEV, PREPROD, PROD) |
n | Implementation name |
rg | Workspace resource group |
w | The name of the workspace |
Call the script like this:
pwsh dbx-monitoring-deploy.ps1 -p /home/Documents/pwsh-spark-monitoring/pwsh-deploy-dbx-spark-monitoring -e DEV -n deploy_log_analytics_wksp_sales -rg rg-dev-datalake -w dbx-dev-datalake-sales
Thanks to this script, you can easily deploy the Spark monitoring library to all your Databricks workspaces.
The logs sent natively allow monitoring of cluster health, job execution and error reporting from laptops. Another way to monitor daily data processing is to perform custom logging using the log4j appender. This way, you can add steps to implement data quality validation across ingested and cleansed data and custom tests with a predefined list of expectations to validate data against.
We might consider using custom logs to log bad records, apply controls and restrictions to the data, and then send quality metrics to Log Analytics for reporting and alerting. To do so, you can build your own data quality library or use existing tools such as Apache Griffin or Amazon Deeque.
#Databricks #log #collection #Azure #Monitor #desktop #scale
Source link