How to improve digital threat monitoring with machine learning

Article by Philip Tully, Head of Data Science Research at Mandiant.

Traditional cybersecurity defenses are designed to protect assets within an organization’s network, but these assets often extend beyond network perimeters, increasing the risk of exposure, theft, and financial loss to a company.

As part of a comprehensive digital risk protection solution, a Digital Threat Monitoring (DTM) solution automatically collects and analyzes content streamed from external sources online, then alerts defenders whenever a threat potential is detected.

This capability enables organizations to expose threats earlier and more effectively identify potential vulnerabilities and exposures before they escalate, without adding operational complexity for already overstretched security teams.

But what is digital threat monitoring and why is it so hard to get it right?

A recently released DTM module alerts customers to threats emanating from social media, the deep and dark web, collage sites and other online channels. An organization can use this module to monitor and gain visibility into digital threats that target their assets in real time, either directly or indirectly.

Advanced DTM can also provide pivotal opportunities for additional enrichment, context, or threat hunting. DTM supports a wide variety of use cases.

  • A threat intelligence analyst wants to uncover threat actors actively targeting our infrastructure, so I can prioritize defenses and remediation.
  • A CISO must identify threats to our suppliers and our supply chain, so that I can proactively mitigate that risk.
  • A threat hunter wants to identify possible data leaks and breaches to uncover attackers in an environment and minimize latency.

The DTM is an ongoing process involving data collection, content analysis, alerting, remediation and removal, and subsequent search refinement and collection, all in a loop. A DTM must continually evolve to allow organizations to be proactive in the face of digital threats.

In addition to the dynamically changing nature of ingested content and the threat landscape itself, the diversity of ingested sources presents another significant technical challenge. While a customer wants a seamless and consistent end-to-end experience for each new source hooked up through DTM, documents derived from different sources can vary widely in structure, semantic composition, language, and length.

Older solutions mainly rely on keyword matching to solve the problems described above. However, individual keywords can match documents in a variety of irrelevant contexts. Additionally, keyword matching is a fragile signature-based approach that inevitably fails to recognize new entities and threats as they evolve.

Worse still, trying to define complex threat concepts like credential dumps or spreading new exploits using simple keyword combinations can be an impossible task. Often this results in huge, totally unmanageable watch rules with hundreds or thousands of unrelated keywords.

Given these challenges, it is essential to adopt a data-driven approach using machine learning to extract valuable insights and present them in a user-friendly way.

The latest DTM modules leverage machine learning (ML) and natural language processing (NLP) to continuously analyze and extract actionable patterns from millions of documents every day. This allows DTM customers to create custom watch rules to quickly identify the content that matters most to their organization.

The DTM is based on seven conditional-access machine learning models that have been implemented, evaluated, and deployed in production. Together, these form an end-to-end, cloud-based NLP pipeline that enriches ingested documents with entity extractions and classifications.

This allows customers to easily query proprietary data stores and customize alerts based on what matters most to them. From a technical point of view, this architecture also derives immediate benefits in terms of the ability to:

  • Measurably reduce false positives and improve the quality of alerts handled
  • Horizontal scaling to handle arbitrary increases in document volume
  • Quickly capture any errors or feedback received to allow us to quickly iterate
  • Expose entities and classifications produced by individual models to feed aggregate views and historical trends.

Advanced NLP techniques based on neural networks have been incorporated into the development of the individual machine learning models that make up the pipeline. State-of-the-art transformative neural networks have been applied to security tasks such as detecting social media news ops, malicious URLs, and even malicious binaries.

Transformers learn context in parallel by tracking long-range relationships between sequential data, like words in a document, beating the previous generation of models that processed words inefficiently in a limited window and produced more errors when linked words occurred far apart.

Additionally, a new semi-supervised topic classifier combines subject matter expert knowledge with a data-driven ML approach to identify high-level threat topics in every document. High levels of accuracy and noise reduction have been achieved using Transformer models and thematic modeling.

Benefits

The high levels of accuracy resulting from pipeline machine learning models translate to improved experiences for customers using DTM. When entity types are extracted from ingested documents, organizations searching for supply chain vulnerabilities affecting Apple products need not scroll through noisy documents mentioning apple pie recipes.

Entities help customers reduce the noise present in large volumes of documents. The most advanced pipeline currently supports over 40 distinct entity types with more planned in the future, giving customers access to a rich set of accurately detected entities to create the most precise in order to be alerted to the most relevant information.

Finally, machine learning simplifies the creation of monitors by allowing customers to filter documents by high-level topics. Documents flowing through the NLP analytics pipeline are tagged with up to 40 threat topic or industry tags, allowing customers to tailor the alerts they receive to common threats and categorized security-related content , or those pertaining specifically to their industry vertical.

Topics provide another way for DTM customers to refine their alerts beyond simple keyword matching, which means that incoming material relating to life hacks or growth hacking is filtered out when specifying a topic. ‘a monitoring condition in which documents must be associated with the information security/compromised subject.

DTM has undergone rigorous and thorough internal evaluation so that users can be confident that the entities and classifications from which the monitors are built reflect state-of-the-art NLP and threat intelligence capabilities.

Sherry J. Basler