A Deep Dive into Phishing URL Detection

CRAC Learning
Jan 11
4 min read

In today's digital age, cyber security is more critical than ever. One of the most prevalent threats is phishing, where attackers deceive individuals into revealing sensitive information such as passwords, banking details, or personal identification by posing as legitimate entities. A primary tool used by these attackers is phishing URLs, which lead unsuspecting users to malicious websites. Detecting these URLs before they cause harm is essential in protecting individuals and organisations alike.

Overview

Phishing URL detection using machine learning has emerged as an effective solution to this problem. Unlike traditional methods, such as blacklisting URLs, machine learning algorithms can dynamically analyze patterns in URLs and identify potential phishing attempts even before they’ve been reported.

During my internship at CRAC Learning, I worked on a critical cyber security project - Phishing URL Detection. The goal was to train a machine learning model that could accurately identify phishing URLs by extracting meaningful patterns from them. The blog provide a technical overview of the project and showcase some of the most important features used to build a successful detection model.

Feature Extraction

When detecting phishing URLs, each feature plays a crucial role in distinguishing legitimate websites from malicious ones. Here’s a breakdown of the importance and need for each feature:

1. Domain Age

Importance: Phishing websites are often short-lived, as attackers use them for quick, targeted attacks before they get blacklisted or shut down. By looking at the domain age, we can detect new or recently created domains, which are suspicious.
Need: Legitimate businesses usually have well-established domains that have been registered for several years. If a domain is younger than a specific threshold (e.g., 40 months), it is flagged as suspicious because it might indicate a newly registered phishing site.

2. DNS Record

Importance: DNS (Domain Name System) records are essential for identifying the ownership and legitimacy of a domain. Phishing URLs often lack valid DNS records because they are quickly set up and discarded once their purpose is served.
Need: A valid and active DNS record typically indicates a legitimate website, while an invalid or missing DNS record suggests that the URL might be associated with a phishing attack.

3. Shortened URL

Importance: URL shortening services (like bit[.]ly or goo[.]gl) are frequently used by attackers to obscure malicious URLs. This hides the true destination of the URL, making it harder for users to recognize phishing attempts.
Need: Detecting shortened URLs helps uncover phishing attempts masked behind legitimate-looking short links, providing an additional layer of protection.

4. Use of IP Address in URL

Importance: Legitimate websites generally use domain names, whereas phishing websites sometimes use raw IP addresses to avoid detection or bypass domain-related checks.
Need: The presence of an IP address in a URL is highly suspicious and often indicates a phishing attempt, especially if combined with other red flags.

5. URL Length

Importance: Phishing URLs are often longer than legitimate URLs. Attackers might add numerous subdomains or parameters to deceive users into thinking the URL is complex and legitimate.
Need: A URL length exceeding a certain threshold (e.g., 60 characters) raises suspicion, as longer URLs can be used to hide malicious content.

6. HTTPS Presence

Importance: While HTTPS is a sign of security, its absence can indicate a lack of effort to secure the website, which is common in phishing sites.
Need: Websites without HTTPS are more likely to be phishing sites since legitimate websites increasingly use HTTPS to encrypt data. However, this feature needs to be used in conjunction with others, as some phishing websites may also use HTTPS.

7. Suspicious Words in URL

Importance: Attackers often use familiar words like “login,” “bank,” “account,” “secure,” or “verify” to trick users into believing they are accessing a trusted site.
Need: Identifying URLs with suspicious words can help flag sites trying to impersonate financial institutions, e-commerce platforms, or email providers—common targets for phishing attacks.

8. Redirects

Importance: Phishing websites frequently use multiple redirects to confuse users or evade detection by security systems.
Need: Tracking the number of redirects a URL follows is crucial, as legitimate websites typically don’t use many redirects. A high number of redirects can indicate an attempt to obfuscate the final destination, which could be a phishing site.

Each of these features, when analyzed together, significantly enhances the ability to detect phishing URLs. By understanding and incorporating these characteristics into a machine learning model, we can create robust systems that protect users from falling victim to phishing attacks.

Implementation

Step 1: Load required libraries and import URL dataset

Step 2: Feature Engineering

To build a robust model, we engineered certain key features based on patterns observed in phishing URLs:

Domain Age

def domain_age(domain): 
	try:        
		whois_info = whois(domain)        
		creation_date = whois_info.creation_date  
		expiration_date = whois_info.expiration_date 
		age_of_domain = abs((expiration_date - 		creation_date).days)        
		return 1 if (age_of_domain / 30) < 40 else
	except:        
		return 1

DNS Record

def dns_record_check(dns_record):    
	return 1 if dns_record == 'Inactive' else 0

Shortened URL

URL Length

Use of IP Address

HTTPS Presence

Suspicious Words

def suspicious_words(url):  
	match = re.search('login|bank|account|paypal|secure|free', url)    	
	return 1 if match else 0

Redirects

Step 3: Model Training

With these features extracted, we moved on to model training using Logistic Regression. This model is a simple yet effective classification technique that works well for binary classification problems.

from sklearn.model_selection 
import train_test_split
# Splitting data into training and testing sets
x = df[['domain_age', 'dns_record', 'shortening_service', 'use_of_ip', 'url_length', 'contains_https', 'suspicious_words', 'count_redirects']]
y = df['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Training the Logistic Regression model
model = LogisticRegression()model.fit(X_train, y_train)

Step 4: Evaluating the Model

We evaluated the model’s performance on the test data using accuracy and a classification report, which helped us understand its predictive power.

# Making predictions on the test set
y_pred = model.predict(X_test)
# Evaluating model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Conclusion

This phishing URL detection project allowed me to delve into the exciting field of cybersecurity and machine learning. By analyzing features such as domain age, DNS records, and suspicious URL patterns, we built an effective model capable of detecting phishing URLs with high accuracy. The experience was both challenging and rewarding, reinforcing the importance of feature engineering in building reliable machine learning models.

Stay Tuned for more details!