Creating a data warehouse is a significant step in leveraging your organisation's data for decision-making and analytics. However, ensuring that you have a robust data warehouse security is equally crucial. This blog will walk you through the best practices for protecting your data warehouse effectively and preventing data breaches.
The Importance of Data Warehouse Security
You’re creating a data warehouse. Awesome! You’ve taken the first step to organising how your business consumes and creates data. You’ve made your first few tables, you’ve connected them to dashboards, and some folks are starting to see the value when suddenly:
- The information commissioner comes knocking, asking how you handle PII data.
- Your CEO said that his buddies at golf were boasting about how their systems caught an employee trying to scrape customer credit card information.
- A table has been deleted for some reason, and now Marketing can’t see their attribution data. CEO wants the person fired, but you don’t know who did it.
- There’s been a day-0 vulnerability detected in your database system, and now all of your customer data is floating about the dark web.
So how do you deal with these challenges? Follow the steps outlined below to mitigate data warehouse threats and ensure the safety of your data.
Step 1: Choosing the Right Data Warehouse Technology
Selecting the appropriate data warehouse technology is a critical decision. You have two main choices: on-premises or on the cloud. Let’s take the security consideration of both into account.
What to look for
There are some signs you should consider when looking at your technology stack with data storage security in mind. Every company’s decision matrix will look a little different, but adding the three points below is a great starting point.
Scalability: As data grows, your chosen technology should effortlessly scale without compromising on performance. Slow performance leads to users trying different things, potentially increasing the risk of shadow IT. Not good.
Integrated Security Features: Look for native security protocols, regular updates, and active community support to stay ahead of potential threats. Anything you can do to give your security team (or IT team) less work, the better.
Compatibility: Ensure it plays well with other tools in your stack, reducing the chances of vulnerabilities emerging from disjointed systems. On one hand, if you’re already running security software, check how it works with data warehousing platforms. On the other hand, check to see how well-integrated the tech is with potential future providers.
So, here are your choices of data storage systems:
Cloud-based Data Warehouse
When utilising cloud-based data warehouses, data frequently travels across the internet, making strong encryption during transit essential. The responsibility of physical security falls on the cloud provider, and while they usually have rigorous security measures in place, there's an inherent trust in a third-party entity.
The advantage of cloud systems often lies in their advanced Identity and Access Management (IAM) systems. These systems offer granular controls and integrate seamlessly with other cloud services, but they can also be prone to misconfigurations, which might expose data.
One unique aspect of cloud environments is their multi-tenancy nature, which necessitates robust isolation measures to prevent data leakage between different tenants. Regulatory considerations might be more complex in the cloud, especially if data is stored in different regions or countries. Thankfully, most cloud providers offer integrated backup and disaster recovery solutions, typically including geographically redundant storage.
Another benefit of the cloud is the integrated monitoring and logging solutions, which provide real-time insights into system activity. However, this centralised solution means a dependence on the cloud vendor for certain security updates and best practices.
Cost structures in the cloud are variable, and while base costs might seem lower, implementing stringent security measures to protect your data can increase expenses. Finally, decommissioning end-of-life hardware securely is the responsibility of the cloud provider, reducing organisational overhead.
On-premises Data Warehouse
In on-premises settings, data primarily remains within the confines of the organisation's internal network. This localisation often reduces the risks associated with data transit over the broader internet. Here, the organisation has direct control over physical security measures, which can be reassuring, though they might not have the vast resources and protocols that major cloud providers can muster.
Access control in on-premises systems might be more straightforward and familiar to the IT teams, but they might not offer the same granularity or integrations as their cloud counterparts. Typically, on-premises solutions are single-tenant, which minimises the risks that come with shared resources in multi-tenant systems.
From a regulatory standpoint, having an on-premises system can simplify compliance considerations since data storage and handling remain local. However, the responsibility for backup and disaster recovery falls squarely on the organisation, which might necessitate additional infrastructure and planning. Monitoring system activity might require third-party tools and integrations, which can be both a cost and a technical challenge.
While there's more control over when and how system updates are implemented, the organisation must be proactive in updating their systems to mitigate vulnerabilities. Costs tend to be more fixed and predictable, but there could be significant upfront expenses for infrastructure. Lastly, the organisation must ensure that any end-of-life hardware is securely wiped or destroyed, preventing potential data breaches.
Step 2: Identity and Access Management (IAM)
Effective data warehouse security best practices begin with robust IAM. At its heart, IAM is about ensuring the right person has the right access. It sounds simple, but mismanagement can lead to catastrophic outcomes.
The dangers of bad IAM management
Improper setup of IAM can have multifaceted repercussions.
Firstly, it can lead to unauthorised data access, resulting in potential data breaches or unintended modifications. This lax oversight can also inadvertently disrupt critical services or, worse, permit elevated privilege attacks where attackers exploit system vulnerabilities.
Financial implications are another concern, especially in cloud environments; attackers with excessive permissions might inflate resource usage, leading to unexpected costs. Companies also risk breaching industry-specific regulations, which could entail hefty fines.
The potential loss of data, either through malicious intent or accidental deletions poses another threat. The reputation of an organisation can be significantly damaged if breaches become public, eroding customer trust.
Additionally, a misconfigured IAM complicates the process of tracking user activities, creating challenges for audits and accountability. Every unnecessary permission granted only enlarges the attack surface, making systems more vulnerable. On the other end of the spectrum, being overly restrictive can hinder genuine users, leading to operational inefficiencies.
In a nutshell, a well-configured IAM is key for both robust security and smooth system operations, emphasising the need for regular audits and meticulous oversight.
So, what can we do to lock down your IAM?
To fortify the security of Identity and Access Management, organisations should diligently adopt a blend of strategic principles and actionable measures.
Role-based access control
At the core of IAM security lies the Principle of Least Privilege (PoLP), which advocates for granting users and services the bare minimum permissions required for their roles, progressively scaling access based on necessity. This dynamic access model underscores the importance of regularly reviewing and recalibrating user permissions to ensure relevancy and precision.
Enhancing basic login protocols with Multi-Factor Authentication (MFA) significantly mitigates unauthorised access risks.
For applications, assigning IAM roles is preferable to hardcoding credentials, providing an adaptable framework for temporary resource access. Continual surveillance of IAM activities, facilitated by automated monitoring tools, offers invaluable insights into user behaviour and potential anomalies like unexpected privilege escalations or unusual login patterns.
Periodic rotation of access keys and other credentials is crucial to counteract the fallout from potential leaks. A holistic IAM strategy promotes a segregation of duties, preventing the accumulation of unchecked power in the hands of a single individual.
Precise IAM policy formulations that avoid generic permissions, combined with an organised structure that groups users based on roles or functions, streamlines management and enhances oversight. The use of root accounts, given their overarching access, should be stringently limited and protected with robust authentication measures.
Central to maintaining a resilient IAM framework and protecting your data is the continual education and training of staff, ensuring they're up to date with evolving best practices. Comprehensive logging and documentation of IAM activities, combined with the proactive termination of dormant accounts, rounds out a thorough approach to IAM security.
By implementing these measures, organisations can navigate the digital landscape with heightened confidence in their security infrastructure.
Step 3: Data Encryption
Data encryption is a fundamental security measure for your data warehouse. The technologies you use (see step 1) should be able to support data encryption by standard. There are two main types of encryption depending on what your data is doing.
- If your data is ‘at Rest’, you must encrypt it in its data store. Most data warehouse platforms have this as default.
- If your data is ‘in transit’, then you have to ensure data is encrypted as it travels between your data warehouse and its relevant data sources. This depends on the implementation of your data pipelines.
There are two main reasons why you want to encrypt your data: compliance and a layer of protection when a data breach occurs. The ICO says you need to encrypt your data, the National Security Guidance tells you to encrypt your data, cyber essentials tells you to encrypt your data. You get the point.
One piece of advice - you can look for a service that automatically manages your encryption keys, and certificates. I’d highly recommend you do. Managing encryption keys is a whole job in itself. If you could do that job, you wouldn’t be reading this article.
Google's tools, for instance, automatically encrypt data at rest and in transit, offering peace of mind to users.
Step 4: Data Loss Prevention
Data will get lost at some point. Someone will accidentally delete a table or dataset, or a malicious actor will decide to dump your entire database for whatever reason. It happens.
The best thing you can do as a data warehouse administrator is try to prevent this from happening. You can use IAM, and you can use encryption, but there are also configurable automated solutions that can help you with data loss prevention (DLP).
Imagine DLP as a ‘security guard’ for your data. Here’s what the security guard does:
- Watchful Eye: DLP constantly monitors what you and your colleagues are doing with the company's digital folders (the data). If someone tries to send an important document to an outside email or copy it to a USB drive, DLP sees it.
- Setting Boundaries: Your company sets rules on what's allowed and what's not. For example, you might be allowed to view certain documents but not send them outside the company. DLP enforces these rules. If you accidentally try to email a confidential file to your personal email, DLP would block it.
- Smart Detective: DLP doesn't just look at file names; it peeks inside to understand the content. So, even if you rename a file from "CompanySecrets" to "HolidayPhotos", the DLP will know the true nature of the contents and act accordingly.
- Protector of Mistakes: We all make mistakes. Maybe you intended to send a document to "John from Accounting", but you accidentally chose "John from High School". DLP catches these innocent mistakes by checking if "John from High School" should really be receiving that document.
- Reporter: DLP also keeps a record. So if it stops someone from sending out sensitive information or spots unusual activity, it makes a note. This helps the company understand where potential risks are and train employees better.
False positives and false negatives
Imagine our DLP security guard is trained to spot certain suspicious behaviors: like someone trying to climb over a fence instead of using the gate. However, sometimes an employee might be hopping over just to retrieve a ball that went over. The security guard might raise an alarm thinking it's a breach, but in reality, it's a harmless act. This is akin to a "false positive" in DLP, where an action is flagged as suspicious or violating a rule, even if the intent is innocent or the activity is legitimate.
On the other hand, there might be situations where the security guard doesn't catch a genuinely suspicious activity because it doesn't fit the usual patterns he's trained to spot. Maybe someone has disguised themselves as a janitor and is quietly sneaking past. This is akin to a "false negative" in DLP, where a real threat or unauthorized action goes unnoticed.
As Google Cloud partners, we have outlined the basics on how to set up DLP on Google Cloud:
- Enable the DLP API
- Make sure the right users can access it using IAM
- Start making your DLP templates or use some of the pre-made ones. A template can be something like “I’d like to redact any sort of data that resembles a national insurance number”. You can have as many of those as you like.
- Set up your automatic triggers, so when do DLP scans run. This could be weekly, daily etc.
At this point, you’re basically good to go. We recommend you start small and add rules as and when. You can also set up more custom alerts, and run manual jobs when you think there’s been a breach, or based on internal decisions. You can automate basically any action in GCP that gets triggered off of a DLP event, for example, you could alert a customer when you detect that your application has had credit card information input into their portal and so much more.
Step 5: Metadata and Governance
At its core, metadata is "data about data". Imagine walking into a vast library; without a catalogue or some system to understand where books are and what topics they cover, you'd be lost. Metadata is our catalogue for the data warehouse. It tells us details about our data - where it came from, when it was modified, who has accessed it, and more.
In the context of Google Cloud, tools like Data Catalog serve as our metadata management solution. They let us discover, understand, and manage all our data assets.
But why is metadata essential for security?
Access Control: Knowing who accessed what data and when helps us monitor for unusual patterns. For instance, if someone who never accesses financial data suddenly downloads massive chunks of it, our system will flag it.
Data Lineage: Metadata lets us trace back where our data came from. If we ever find corrupted or suspicious data, we can use this lineage to determine its origin and rectify issues at the source.
Data governance refers to the set of rules, policies, and processes we put in place to manage and protect our data. Think of governance as the library's rules - how books are borrowed, who can access the rare books section, how often the inventory is checked, etc.
If you’re building your own library, you have to create your own systems for this. If you’re using someone else's library to store your books (the cloud), there are some features that come with it.
Different cloud providers offer different governance tools. In the context of Google Cloud, it offers several governance tools, but the key ones are:
Column-level Security: BigQuery allows organisations to set fine-grained access controls at the column level, ensuring that only authorised users can view sensitive columns.
Row-level Security: Through Authorised Views, BigQuery supports row-level security, enabling tailored data access for different user groups based on specific conditions.
Data Classification and DLP Integration: Data can be classified based on its sensitivity, reducing the risk through techniques that anonymise or de-identify and redact sensitive data.
Data Retention and Automatic Deletion: Users can set table expiration dates to ensure data is automatically deleted after a specified time, helping with data lifecycle management.
Geographic Data Management: BigQuery's multi-region setup allows organisations to specify where their data resides, assisting with data residency requirements that might be mandated by regional laws or company policies.
In essence, metadata gives us a deep understanding of our data, while governance ensures we have the right processes and controls to manage and protect it. Both are foundational for establishing best practices for securing your data warehouse and preventing data breaches.
Securing a data warehouse is a journey, not a destination. As technology and threats evolve, so should your strategy. By following the steps outlined above, you lay down a robust foundation for data security.
If this seems daunting, don't worry! As a Google Cloud Partner, specialising in Looker and BigQuery, we have the expertise to guide and assist you in implementing these steps using Google's state-of-the-art tools. Let us help you secure your data warehouse seamlessly. Contact us today.
Want to know more about data in GCP. Here are a few more useful articles: