Data can be both an asset and a liability. As enterprises evolve, the volume and complexity of data required to support the business increases. All organizations store sensitive data on-premises, and increasingly in the cloud, that their customers, business partners, shareholders and Board expect them to protect against theft, loss and misuse. Gartner asserts that security and risk management leaders need to shift from current trends in data loss prevention and “implement a holistic data security governance strategy.”
Data loss prevention (DLP) solutions are growing in popularity as enterprises look for ways to reduce the risk of sensitive data leaking outside the company. A DLP solution relies on several core technologies that enable its engine to accurately identify the sensitive data that enterprises need to secure and take remediation action to prevent incidents. This post covers the different technologies employed by DLP solutions today.
Data loss prevention (DLP), per Gartner definition, is a set of technologies which perform both content inspection and contextual analysis of data sent via messaging applications such as email and IM, in motion over a network, in use on a managed endpoint device, and at rest in on-premises file servers or in cloud applications and cloud storage. These solutions execute responses based on policy and rules defined to address the risk of inadvertent or accidental leaks, or exposure of sensitive data outside authorized channels.
DLP technologies are broadly divided into two categories – Enterprise DLP and Integrated DLP. While Enterprise DLP solutions are comprehensive and packaged in agent software for desktops and servers, physical and virtual appliances for monitoring networks and email traffic, or soft appliances for data discovery, Integrated DLP is limited to secure web gateways (SWGs), secure email gateways (SEGs), email encryption products, enterprise content management (ECM) platforms, data classification tools, data discovery tools and cloud access security brokers (CASBs).
Understanding the differences between content awareness and contextual analysis is essential to comprehend any DLP solution in its entirety. Essentially, if content is a letter, context is the envelope. While content awareness involves capturing the envelope and peering inside it to analyze the content, context includes external factors such as header, size, format, etc., anything that doesn’t include the content of the letter. The idea behind content awareness is that although we want to use the context to gain more intelligence on the content, we don’t want to be restricted to a single context.
Once the envelope is opened and the content processed, there are multiple content analysis techniques which can be used to trigger policy violations, including:
- Rule-Based/Regular Expressions: The most common analysis technique used in DLP involves an engine analyzing content for specific rules such as 16-digit credit card numbers, 9-digit US social security numbers, etc. This technique is an excellent first-pass filter since the rules can be configured and processed quickly, although they can be prone to high false positive rates without checksum validation to identify valid patterns.
- Database Fingerprinting: Also known as Exact Data Matching, this mechanism looks at exact matches from a database dump or live database. Although database dumps or live database connections affect performance, this is an option to analyze structured data from databases.
- Exact File Matching: File content is not analyzed; however, the hashes of files are matches against exact fingerprints. Provides low false positives although this approach does not work for files with multiple similar but not identical versions.
- Partial Document Matching: Looks for complete or partial match on specific files such as multiple versions of a form that have been filled out by different users.
- Conceptual/Lexicon: Using a combination of dictionaries, rules, etc., these policies can alert on completely unstructured ideas that defy simple categorization. Need to be customized for the DLP solution provided.
- Statistical Analysis: Uses machine learning or other statistical methods such as Bayesian analysis to trigger policy violations in secure content. Requires a large volume of data to scan from, the bigger the better, else prone to false positives and negatives.
- Pre-built categories: Pre-built categories with rules and dictionaries for common types of sensitive data, such as credit card numbers/PCI protection, HIPAA, etc.
Data protection is one of the primary concerns when adopting cloud services. The average enterprise uses almost 2,000 cloud services and employees often introduce new services on their own. Analyzing aggregated, anonymous event data from the cloud usage of 30 million users, McAfee found that 21% of documents uploaded to file sharing services contain sensitive information such as personally identifiable information (PII), protected health information (PHI), payment card data, or intellectual property, creating compliance concerns. It follows that employing the right DLP solution in the cloud which encompasses accuracy, real-time monitoring, analysis of data in motion, incident remediation and data loss policy authoring is essential for successful cloud adoption.