Security practices and design principles for implementing a data lakehouse solution in Azure Synapse

Background

The Lake House "The Lake House" by YellowstoneNPS. Licensed under CC PDM 1.0

Synapse is a versatile data platform that supports enterprise data warehousing, real-time data analytics, data pipeline, time-serious data processing, machine learning, and data governance. It integrates several different technologies (e.g., SQL DW, Serverless SQL, Spark, Data Pipeline, Data Explorer, Synapse ML, Purview…) to support these various capabilities. However, this also inevitably increases the complexity of the system infrastructure.

img

In this blog, I would like to share the security learning after implementing a data lakehouse project for an international manufacturing company using Synapse. Data Lakehouse is a modern data management architecture that combines data lakes’ cost-efficiency, scale, and flexibility features with data warehouse’s data and transaction management capabilities. It well supports business intelligence and machine learning (ML) scenarios for many diverse data structures and data sources. Some common use cases for the solution are IoT telemetry analysis, consumer activities and behavior tracking, security log monitoring, or semi-structured data processing.

We will focus on the security design and implementation practices used in the project. In the project, we chose Synapse serverless SQL and Synapse Spark to implement the data lakehouse pattern. Following is the high-level solution design architecture.

img

Fig.1 High-level concept of the solution
Source: The best practices for organizing Synapse workspaces and lakehouse

Design Focus

We started the security design by using the Threat Modeling tool. The tool helps us communicate with project stakeholders about the potential risks and define the trust boundary in the system. Based on the thread modeling result, Identity and Access control, Network protection, and DevOps security are prioritized in the project. Based on these priorities, we implemented additional security features and changed the infrastructures so we could protect the system and mitigate key security risks identified. These also map to Access Control, Asset Protection, and Innovation Security in the Cloud Adoption Framework (CAF)’s security disciplines. We will walk through the design principles and related technologies in more detail In the following sections.

Security Design Principles and Learning

Network and Asset protection Design

One of the key security assurances principles in the Cloud Adoption Framework is the Zero Trust principle. When designing security for any component or system, we should reduce the risk of attackers expanding access by assuming other resources in the organization are compromised.

Based on the threat modeling discussion result, we follow the micro-segmentation deployment recommendation in zero-trust and define several security boundaries. VNet and Synapse data exfiltration protection are the key technologies used to implement the security boundary and protect the system’s data assets and critical components.

Considering Synapse is a composition of several different technologies, we need to :

  • Identify essential components of Synapse and related services used in the project.

    Synapse is a very versatile data platform. It can handle and fulfill many different data processing needs. First, we need to decide which components in Synapse are used in the project to plan how to protect them. Also, we need to determine what other services are communicating with these Synapse’s components. In the data data lakehouse architecture, Synapse Serverless SQL, Synapse Spark, Synpase Pipeline,Azure Data Lakes and Azure DevOps are the key components.

  • Define the __legal communication behaviors__ between the components.

    We need to define the “legal” communication behaviors between the components. For example, do we want the Synapse Spark engine to communicate with the dedicated SQL instance directly, or do we want the spark engine to communicate with the database through a proxy such as Synapse Data Integration pipeline or Data Lake?

    Based on the Zero trust principle, we should block the communication if there is no business need for the interaction. For example, we should block the communication if a Synapse Spark engine directly communicates with Data Lake storage in an unknown tenant.

  • Chose the proper security solution that can enforce the defined communication behaviors.

    In Azure, several security technologies are capable of enforcing the defined service communication behaviors. For example, in Azure Data Lake storage, you can use a white-list IP address to control its access, but you can also choose allowed VNet, Azure services, or resource instances. Each protection method provides different security protection and needs to be selected based on the business needs and environmental limitations. I will describe the configuration we used in our project in the next section.

  • Add threat detection and advanced defense for critical resources.

    For critical resources, it is better to add threat detection and advanced defense. These services help identify threats and triggers alerts. So the system can notify users about the security breach.

Network and Asset protection Implementation in the project

In the data lakehouse solution, we designed and controlled the service’s interaction behaviors based on business requirements to mitigate security threats. The following table shows the defined communication behaviors and security solutions used in the project.

From (Client) To (Service) Behavior Configuration Notes  
Internet Azure DataLake Deny All Firewall Rule - Default Deny Default: ‘Deny’ Firewall Rule - Default Deny
Synapse Pipeline/Spark Azure DataLake Allow (Instance) VNet - Managed Private EndPoint (Azure DataLake)    
Synapse SQL Azure DataLake Allow (Instance) Firewall Rule - Resource instances (Synapse SQL) Synapse SQL needs to access Azure DataLake using Managed Identity N/A
Azure Pipeline Agent Azure DataLake Allow (Instance) * Firewall Rule - Selected Virtual networks
* Service Endpoint - Storage
For Integration Testing
bypass: ‘AzureServices’ (firewall rule)
 
Internet Synapse Workspace Deny All Firewall Rule   Firewall Rule
Azure Pipeline Agent Synapse Workspace Allow (Instance) VNet - Private EndPoint Requires 3 Private EndPoints (Dev, Serverless SQL, and Dedicate SQL)  
Synapse Managed VNet Internet/ Unauthorized Azure Tenant Deny All VNet - Synapse Data Exfiltration Protection    
Synaspe Pipeline/Spark KeyVault Allow (Instance) VNet - Managed Private EndPoint (KeyVault) Default: ‘Deny’  
Azure Pipeline Agent KeyVault Allow (Instance) * Firewall Rule - Selected Virtual networks
* Service Endpoint - KeyVault
bypass: ‘AzureServices’ (firewall rule)  
Azure Functions Synapse Serverless SQL Allow (Instance) VNet - Private EndPoint (Synapse Serverless SQL)    
Synaspe Pipeline/Spark Azure Monitor Allow (Instance) VNet - Private EndPoint (Azure Monitor)    

The below diagram shows the architecture with the network and asset protection design.
img

For example, the above diagram includes:

  • Create a Synapse workspace with a managed virtual network.
  • Securing data egress from Synapse workspaces through Synapse workspaces Data exfiltration protection.
  • Manage the list of approved Azure AD tenants for the Synapse workspace.
  • Configure network rules to grant only traffic from selected virtual networks access to storage account and disable public network access.
  • Use Managed Private Endpoints to connect Synapse managed VNet with Data Lake.
  • Use Resource Instance to securely connect Synapse SQL with Data Lake

For better Network and Asset protection, the following are additional security design considerations.

  • Deploy Perimeter Networks for Security Zones for Data Pipeline.

    Because in a data pipeline, data could be loaded from external data sources. When a data pipeline workload requires access to external data and data landing zone, it is better to implement a perimeter network and separate it with a regular ETL pipeline.

  • Enable Azure Defender for all storage accounts.

    Azure Defender provides an additional layer of security intelligence that detects unusual and potentially harmful attempts to access or exploit storage accounts. Security alerts are triggered in Azure Security Center.

  • Lock storage account to prevent malicious deletion or configuration changes

Identity and Access control

There are several parts in the system. Each part requires a different Identity and Access Management (IAM) configuration. They will need to collaborate tightly to provide a streamlined user experience. Therefore, we need to plan the following parts when we implement authentication and authorization control.

img

  • Chose Identity type in different Access Control Layers

    There are four different identity solutions in the system.

    • SQL Account (SQL Server)
    • Service Principal (Azure AD)
    • Managed Identity (Azure AD)
    • User AAD Account (Azure AD)

    Also, there are four different access control layers in the system.

    • Application access layer
    • Synapse access layer
    • SQL DB access layer
    • Azure Data Lake access layer

    A key part of identity and access control is choosing the right identity solution for different user roles in each access control layer. The Well Architecture Security Design Principles suggests using native controls and driving simplicity. Therefore, we decided to use User AAD account in the Application, Synapse, and SQL DB access layer to leverage the native first-party IAM solutions while using Managed Identity of Synapse to access Azure Data Lake to simply the authorization process.

  • Consider Least-privileged access

    In Zero trust guiding principles, it suggests providing just-in-time and just-enough-access to critical resources. Azure AD Privileged Identity Management (PIM) and enhance security deployment check in the future.

  • Protect Linked Service

    Linked services define the connection information needed for the service to connect to external resources. It is important to secure the linked Service configuration and access in Synapse.

DevOps security

  • Use VNet enabled self-hosted pipeline agent

    Default Azure DevOp pipeline agent couldn’t support VNet communication because it used a very wide IP range. Therefore, we implemented Azure DevOps self-hosted agent in VNet so the DevOps process can be protected and smoothly communicate with the whole system. In addition, VM scale sets are used to ensure the DevOps engine can scale up and down.

img

  • Implement infrastructure security scanning & security smoke testing in CI/CD pipeline

    Static analysis tool for scanning infrastructure as code (IaC) files can help detect and prevent misconfigurations that may lead to security or compliance problems. In addition, security smoke testing ensures that the vital system security measure is successfully enabled and prevents security risk due to a design fault in the deployment pipeline.

    • Use static analysis tool for scanning infrastructure as code (IaC) templates to detect and prevent misconfigurations that may lead to security or compliance problems. Use tools such as Checkov or Terrascan to detect and prevent security risks.
    • Make sure the CD pipeline correctly handles the failure of the deployment. Any deployment failure related to security features should be treated as a critical failure. It should retry the failed action or hold the deployment.
    • Validate the security measures in the deployment pipeline by running security smoke testing. The security smoke testing, such as validating the configuration status of deployed resources or testing cases that examine critical security scenarios, can ensure that the security design is working as expected.

Security Score assessment and Threat Detection

To understand the security status of the system, we used Microsoft Defender for Cloud to assess the infrastructure security and detect the security issues. Microsoft Defender for Cloud is a tool for security posture management and threat protection. It can protect workloads running in Azure, hybrid, and other cloud platforms.

img

You can enable Defender for Cloud’s free plan on all your current Azure subscriptions when you visit the Defender for Cloud pages in the Azure portal for the first time. I highly recommend enabling it so you can get your Cloud security posture evaluation and suggestions. And it is all free, so why not ☺️. Microsoft Defender for Cloud will provide you security score and some security hardening guidance for your subscription.

img

The information is quite straightforward, and the recommendations are excellent. Many of the recommendations can are easy to take action.

img

And if you need advanced security management and threat detection capabilities, which provide features such as suspicious activities detection and alerting. You can enable Cloud workload protection individually for different resources. So you have the option to choose the most cost-effective way to protect the system.

Summary

With the combination of Synapse Serverless SQL and Synapse Spark, we built a flexible, scalable ,and cost-effective data lakehouse solution in the project.This article tried to summarize the security design principle and practices we implemented in the solution. We start from the Threat Modeling tool and guidance in the Cloud Adoption Framework (CAF)’s security disciplines. To harden the security protection, the project team decided to focus on Network and Asset Protection, Identity and Access control, and DevOps security. We also evaluated the security score of the system and reviewed the security suggestions provided by Microsoft Defender for cloud. You need to choose between several different configurations when you want to protect the network. The article also describes the design after we compared different options. Identity and Access control are crucial for securing the data asset in the system. Especially for the data lakehouse solution, you need to map different access control layers and choose the right identity solution. I hope this learning will also help you implement a secure data lakehouse solution on Azure Synapse.



Reference