Anticipating new spam domains with machine learning
French researchers have developed a method to identify newly registered domains that are likely to be used in a “hit and run” fashion by high-volume email spammers – sometimes even before the spammers have sent a message. unwanted email.
The technique is based on analyzing how the Sender Policy Framework (SPF), a method of verifying the provenance of emails, has been implemented on newly registered domains.
Through the use of passive Domain Name System (DNS) sensors, researchers were able to obtain near real-time DNS data from Seattle-based Farsight, producing SPF activity for TXT records for a range of domains.
Using a class-weighting algorithm originally designed to deal with unbalanced medical data and implemented in the machine-learning Python library scikit-learn, researchers were able to detect three-quarters of spam domains in waiting in a few moments, even before their operation.
The paper states:
“With a single request to the TXT record, we detect 75% of spam domains, possibly before the spam campaign begins. Thus, our system provides an important reaction speed: we can detect spammers with good performance even before a mail is sent and before a peak in DNS traffic. »
The researchers say the features used in their technique could be added to existing spam detection systems to increase performance, and without adding significant computational overhead, as the system relies on SPF data passively inferred from DNS streams by near real time already used. for different approaches to the problem.
The paper is titled Early detection of spam domains with Passive DNS and SPFand from three researchers from the University of Grenoble.
SPF is designed to prevent email address spoofing, by verifying that a registered and authorized IP address has been used to send email.
Other email verification methods include DomainKeys Identified Mail (DKIM) signatures and Domain-Based Message Authentication, Reporting, and Compliance (DMARC).
All three methods must be registered as TXT records (configuration settings) with the domain registrar for the authentic sending domain.
Spam and burning
Spammers exhibit “signature behavior” in this regard. Their intention (or, at least, the collateral effect of their activities) is to “burn” the reputation of the domain and its IP addresses by eliminating mail en masse until one or other of the measures be taken by the network providers selling these services; or associated IP addresses are registered in popular spam filter lists, making them useless to the current sender (and problematic to future owners of the IP addresses).
When the domain location is no longer feasible, spammers move on to other domains and services as needed, repeating the process with new IP addresses and configurations.
Data and methods
The areas studied for research cover the period between May and August 2021, as reported by Farsight. Only newly registered domains were taken into account, as this is in accordance with the modus operandi persistent spammer.
The domain list was constructed using data from ICANN’s Central Zone Data Service (CZDS). Blacklist information from the SURBL and SpamHaus projects has been used to perform near real-time identification of potentially problematic new domain registrations – although the authors admit that the imperfect nature of spam lists can lead to domains being accidentally classified. benign as potential sources of bulk mail.
After capturing DNS TXT queries to newly registered domains found in the passive DNS stream, only queries with valid SPF data were retained, providing ground truth for the algorithms.
SPF has a number of usable features; the new paper found that while “harmless” domain owners most often use the +include mechanism, spammers have the highest use of the (now obsolete) +ptr feature.
A +ptr lookup compares the IP address of the sending mail to any existing record for an association between that IP address and the hostname (i.e. GoDaddy). If the host name is discovered, its domain is compared to the one that was first used to reference the SPF record.
Spammers can exploit the apparent rigor of +ptr to present themselves in a more credible light, when in fact the resources required to perform large-scale +ptr research forces many providers to ignore verification altogether.
In short, the way spammers use SPF to secure a window of opportunity before blast and burn begins represents a characteristic signature that can be inferred by machine analysis.
Since spammers often move to very close ranges of IP addresses and resources, the researchers developed a relationship graph to explore the correlation between IP address ranges and domains. The chart can be updated in near real time in response to new data from SpamHaus and other sources, becoming more useful and comprehensive over time.
The researchers state:
“Studying these structures can highlight potential spam domains. In our dataset, we found [structures] in which dozens of domains used the same [SPF] rule and the majority of them were on spam blacklists. As such, it is reasonable to assume that the remaining domains have probably not yet been detected or are not active spam domains yet.
The researchers compared the spam domain detection latency of their approach to SpamHaus and SURBL over a 50-hour period. They report that for 70% of identified spam domains, their own system was faster, while admitting that 26% of identified spam domains appeared in commercial blacklists within the next hour. 30% of domains were already blacklisted when they appeared in the passive DNS feed.
The authors claim an F1 score of 79% against ground truth based on a single DNS query, while competing methods such as exposure may require a week of preliminary analysis.
“Our scheme can be applied to the early stages of a domain’s lifecycle: using passive (or active) DNS, we can get SPF rules for newly registered domains and classify them immediately, or wait for us to detect them. TXT queries to that domain and refine the classification using hard-to-evade temporal features.
And carry on:
‘[Our] the best classifier detects 85% of spam domains while maintaining a false positive rate of less than 1%. The detection results are remarkable given that the classification only uses the contents of the domain’s SPF rules and their relationships, and hard-to-escape features based on DNS traffic.
“The performance of the classifiers remains high, even if they only receive the static features that can be gathered from a single TXT request (either passively observed or actively queried).”
To see a presentation on the new method, check out the video embedded below:
First published May 5, 2022.