URL Analysis 101: Automating Phishing Investigations with Machine Learning

Analyzing suspicious URLs on an individual basis can be tricky, but when you’re facing a large volume of potentially malicious URLs then other approaches that leverage automation (like machine learning) become critical.

At Intezer, we recently launched a URL scanning feature that will allow detecting phishing/malicious URLs. To do so, we have multiple integrations with services such as URLscan and APIVoid, and additionally, we are adding in-house built tools. We built our URL scanner to be capable of both automated use cases (adding a deeper layer of automated triage and analysis into phishing investigation pipelines with thousands of suspicious URLs), as well as “manual” analysis of individual URLs discovered during investigations.

This blog is the second of a URL analysis 101 series. In the previous part, we discussed how people with malicious intent can modify URLs to leverage human and browser weaknesses to successfully “phish” people clicking on URLs. In this part, we are going to discuss automatic detection for URLs and some of the reasoning behind the various methods. We are going to emphasize the popular use of machine learning for such a task.

Motivation for Using Machine Learning

To understand the motivation behind this blog, we first must understand the general methods of automatic URL detection.

In general two methods are discussed:

  1. List-based – Blocklisting or allowlisting of websites, examples for these kinds of lists are:

  • URLHaus
  • Phishstats
  • Alexa rank (a high rank will give credibility to the trustworthiness of the website)

  1. Machine learning-based – Usage of labeled URLs and extracted features from the URLs to train ML models.

Pros And Cons of List-Methods and Machine Learning Methods

List-based methods

  • Pros: These methods are very precise, meaning if a URL is included in a blocklist it is malicious with a very high probability and will generate a very low rate of false positives.
  • Cons: The problem with list-based methods is that they will not be able to detect new phishing websites, because any URL that is not in the list will not be identified.

Machine learning-based methods

  • Pros: ML methods are able to detect new malicious URLs based on the URLs and/or web page characteristics.
  • Cons: ML methods will generally generate more false positives than list-based methods.

Machine Learning: A Small Intro

Don’t panic! No maths and not very complicated material in this section.

Please note that we are not going to give a full machine learning solution to the problem but merely give an intuition of a machine learning approach.

In order for our machine learning model to learn the differences between phishing and trusted URLs, you need to be able to provide it with labeled data, in our case, “trusted”/“phishing” for each URL.

When we have labeled data we call the machine learning process supervised as opposed to unsupervised learning where we get a dataset that does not have the labels which we need to learn.

The decision boundary shown in the plot was determined by a Machine Learning algorithm known as SVM or Supplied Vector Machines.

Using this simple model, we can visually divide the plane into two regions: below the classification line is the trusted URL area, and above is the phishing URL area.

For instance, a new URL entering this classifier with a length of 50 and 20 digits would be classified as a phishing URL, while a URL with a length of 50 and less than 5 digits would be classified as trusted.

It is important to note that with only two features (length and number of digits), there are some false negatives, where 3 phishing URLs are mistakenly classified as trusted URLs.

Automating URL Analysis

Large organizations are increasingly turning to automation to streamline high-volume phishing investigation pipelines, utilizing methods such as machine learning for URL detection. This section of the blog explores how machine learning can automate URL detection, discussing the pros and cons of using machine learning versus list-based detection methods.

We hope you found this blog informative. If you have any further questions about our products or URL analysis tools, please don’t hesitate to reach out.

Create a free Intezer account here – sign up for a 14-day trial to start scanning URLs, suspicious QR codes, and analyzing dropped malware now.


Daniel Pienica

Daniel is a data scientist at Intezer.

Leave a Reply

Your email address will not be published. Required fields are marked *