Cloud-based services have emerged for (almost) every facet of information storage, processing and communication and enterprises now leverage cloud services across a wide gamut of use cases including some that entail highly sensitive data, such as sales tracking, customer management, payroll, accounting, etc. The exponential rate of enterprise cloud service adoption coupled with the steep increase in security & data breaches (48% YoY according to IDG’s State of Security Report), increases the necessity to protect the data in the cloud.

Gartner warns clients to prioritize cloud threat protection

Gartner now estimates that through 2020, 95% of cloud security breaches will be due to inappropriate or negligent usage. Given that a large proportion of the known breaches have emanated from users with valid login credentials, either as a compromised account or a malicious user, cloud threat protection has taken center stage.

In fact, in Gartner’s Latest CASB report, How to Evaluate and Operate a Cloud Access Security Broker, they advise clients to, “Make threat protection an integral CASB capability” and further state that “This is critical, because cloud services reside outside traditional enterprise security protection, such as intrusion prevention systems (IPSs) and anti-malware scanning” and that “phishing of employees and the resulting hijacking of endpoints and accounts, has become one of the most common causes of security failure in the enterprise”

Monitoring all enterprise users (potentially tens or hundreds of thousands) across all different services presents a unique challenge that can only be effectively addressed using cutting-edge cloud-scale machine learning. Machine-learning enabled User and Entity Behavior Analysis (UEBA) is at the core of Skyhigh’s Threat Protection and the solution to the limitations posed by the traditional methods. In this blog we’ll lift the curtains to explain how Skyhigh leverages UEBA to detect advanced cloud security threats.

Skyhigh Threat Protection

Learn how Skyhigh leverages data science to detect threats in the cloud.

Download Now

Background on shortcomings of traditional approaches

Traditional security solutions have either relied on heuristics (based on domain knowledge) or static models (based on metrics computed from the data). Not only, does the quality of heuristics heavily depend on the security expert, it also has limited effectiveness considering the rapidly evolving nature of the security breaches. Eliciting subject matter expertize is as challenging as encoding it into deterministic rules.

Static models on the other hand place trust in the underlying data but are also not effective as they are, by design, tailor to an average user. A larger user base makes matters worse as the nuances in how individuals use a cloud service are much more pronounced. Thresholds are used to identify anomalies without regard to the presence of these sub-populations (non-traditional user groups) in the data and unless explicitly captured, outlier & threshold interplay also tends to be ignored. The traditional static models are typically designed to only identify point singularities, therefore limiting the scope of ‘normal’ and making them oblivious to the uniqueness of a user and the context.

The next frontier: User and Entity Behavioral Analysis (UEBA)

As discussed in the preceding blog of this series, Skyhigh’s Threat protection begins by providing thorough visibility into who is accessing what data, across all enterprise provisioned cloud services. Built in activity monitoring analyzes petabytes of usage data from over 23 million users across a large number of cloud services to provide security teams with actionable intelligence. Machine learning-based models across a number of facets are used to identify anomalous activities.

The building blocks of modern UEBA, to which Skyhigh subscribes, are:

  1. User & User Group Behavioral Analysis
  2. User Group Identification
  3. Time Evolving Behavior Models
  4. Dynamic Self-Learning

Within the realm of cloud service anomaly detection, user behavior is defined as a composite of one or more of service specific actions, the content and nature of objects, data movement, number of times a service is accessed, rate of access, time of access, etc. measured either across a cloud service or a homogenous group of cloud services.

In the context of an enterprise and one cloud service, individual user behavior will potentially vary across multiple dimensions such as time of use, rate of use, aggregate use, level of use etc. The source of variation in use may arise as much from personal preferences as from enterprise enforced policies and practices. Without visibility into the corporate policies or an individual’s preferences, the only observable artifact remains the actual usage. One can also assert that the observed actual usage is a manifestation of the hidden state of the user (referring to, for example, unobservable preferences and policies). The problem at hand is therefore that of modeling the user behavior as a combination of unobserved and immeasurable data, where the combination varies from one user to the other.

Is time series analysis the answer?

It is reasonable to expect that an individual’s cloud service usage is different during different times of the day and week, and the usage can also change based on the phase of a project that one is working on. Usage patterns also tend to evolve over time.

With this understanding that an individual’s behavior has a strong time reference, time series analysis of the user data becomes an obvious choice. This will also yield well to the idea of modeling panel data (i.e. simultaneous modeling of behavior across different actions or cloud services without actually aggregating the data). Given a model for the user’s behavior, one can detect deviation from expected usage and potentially characterize it as anomalous.

Classical time series methods are restricted in many ways, primarily in the expectation that the underlying behavior is stationary or that the rate at which it changes is constant (as in a trend).

Circumventing the constraints of time series analysis

Skyhigh’s UEBA leverages behavioral models with time varying parameters to circumvent both the constraints, which in turn capture the dynamic and time-varying aspects of the behavior. These models are further enhanced with separate components for local level (e.g. to represent department or regional patterns in usage), trend (e.g. to represent stages of adoption), seasonality (e.g. to represent usage patterns around holidays), exogenous variables (e.g. to represent external intelligence feeds) and stochastic noise (e.g. to represent unsegregated endpoint application beaconing).

Employing the 4 building blocks of UEBA: user & user group behavioral analysis, user group identification, time evolving behavior models, and dynamic self-learning

The behavioral models thus capture the heterogeneity in the usage levels across different users in addition to the nuances associated with the user-cloud service combinations. Typical behavioral models also expect a lot of training data to be thrown at them, thus confidence in the models is implicitly tied to that of the historic usage data that is available. And like with most things, the 80-20 rule is observed in cloud service usage data in two forms.

One, a large proportion of the usage data is from only a small subset of the users and two, a large proportion of the usage data is related only to a handful of actions (for e.g. ‘Login’ is observed a lot more often than ‘Delete User’).

The sparse nature of the usage data is therefore widespread and is observed even when analyzing years of cloud service data. Skyhigh’s break though came from leveraging a proprietary Bayesian algorithm to adaptively sample data to compensate for the lack of user specific data. Skyhigh’s UEBA exploits the network effect of a large customer base along with the fact that enterprise users in the same organizational structure or job responsibility tend to display comparable behaviors, in borrowing confidence from models that have been built using adequate amount of data. Such an approach combined with automatic group identification renders a tangible approach to solving the sparse data challenge.

In the third and final installment of the “The Data Science Powering Skyhigh’s Cloud Threat Protection” series we’ll provide concrete instances of cloud threats that were caught by employing data science that would have otherwise gone unnoticed.