In today's digital age, cyber security is more critical than ever. One of the most prevalent threats is phishing, where attackers deceive individuals into revealing sensitive information such as passwords, banking details, or personal identification by posing as legitimate entities. A primary tool used by these attackers is phishing URLs, which lead unsuspecting users to malicious websites. Detecting these URLs before they cause harm is essential in protecting individuals and organisations alike.
Overview
Phishing URL detection using machine learning has emerged as an effective solution to this problem. Unlike traditional methods, such as blacklisting URLs, machine learning algorithms can dynamically analyze patterns in URLs and identify potential phishing attempts even before they’ve been reported.
During my internship at CRAC Learning, I worked on a critical cyber security project - Phishing URL Detection. The goal was to train a machine learning model that could accurately identify phishing URLs by extracting meaningful patterns from them. The blog provide a technical overview of the project and showcase some of the most important features used to build a successful detection model.
Feature Extraction
When detecting phishing URLs, each feature plays a crucial role in distinguishing legitimate websites from malicious ones. Here’s a breakdown of the importance and need for each feature:
1. Domain Age
Importance: Phishing websites are often short-lived, as attackers use them for quick, targeted attacks before they get blacklisted or shut down. By looking at the domain age, we can detect new or recently created domains, which are suspicious.
Need: Legitimate businesses usually have well-established domains that have been registered for several years. If a domain is younger than a specific threshold (e.g., 40 months), it is flagged as suspicious because it might indicate a newly registered phishing site.
2. DNS Record
Importance: DNS (Domain Name System) records are essential for identifying the ownership and legitimacy of a domain. Phishing URLs often lack valid DNS records because they are quickly set up and discarded once their purpose is served.
Need: A valid and active DNS record typically indicates a legitimate website, while an invalid or missing DNS record suggests that the URL might be associated with a phishing attack.
3. Shortened URL
Importance: URL shortening services (like bit[.]ly or goo[.]gl) are frequently used by attackers to obscure malicious URLs. This hides the true destination of the URL, making it harder for users to recognize phishing attempts.
Need: Detecting shortened URLs helps uncover phishing attempts masked behind legitimate-looking short links, providing an additional layer of protection.
4. Use of IP Address in URL
Importance: Legitimate websites generally use domain names, whereas phishing websites sometimes use raw IP addresses to avoid detection or bypass domain-related checks.
Need: The presence of an IP address in a URL is highly suspicious and often indicates a phishing attempt, especially if combined with other red flags.
5. URL Length
Importance: Phishing URLs are often longer than legitimate URLs. Attackers might add numerous subdomains or parameters to deceive users into thinking the URL is complex and legitimate.
Need: A URL length exceeding a certain threshold (e.g., 60 characters) raises suspicion, as longer URLs can be used to hide malicious content.
6. HTTPS Presence
Importance: While HTTPS is a sign of security, its absence can indicate a lack of effort to secure the website, which is common in phishing sites.
Need: Websites without HTTPS are more likely to be phishing sites since legitimate websites increasingly use HTTPS to encrypt data. However, this feature needs to be used in conjunction with others, as some phishing websites may also use HTTPS.
7. Suspicious Words in URL
Importance: Attackers often use familiar words like “login,” “bank,” “account,” “secure,” or “verify” to trick users into believing they are accessing a trusted site.
Need: Identifying URLs with suspicious words can help flag sites trying to impersonate financial institutions, e-commerce platforms, or email providers—common targets for phishing attacks.
8. Redirects
Importance: Phishing websites frequently use multiple redirects to confuse users or evade detection by security systems.
Need: Tracking the number of redirects a URL follows is crucial, as legitimate websites typically don’t use many redirects. A high number of redirects can indicate an attempt to obfuscate the final destination, which could be a phishing site.
Each of these features, when analyzed together, significantly enhances the ability to detect phishing URLs. By understanding and incorporating these characteristics into a machine learning model, we can create robust systems that protect users from falling victim to phishing attacks.
Implementation
Step 1: Load required libraries and import URL dataset
Step 2: Feature Engineering
To build a robust model, we engineered certain key features based on patterns observed in phishing URLs:
Domain Age
def domain_age(domain):
try:
whois_info = whois(domain)
creation_date = whois_info.creation_date
expiration_date = whois_info.expiration_date
age_of_domain = abs((expiration_date - creation_date).days)
return 1 if (age_of_domain / 30) < 40 else
except:
return 1
DNS Record
def dns_record_check(dns_record):
return 1 if dns_record == 'Inactive' else 0
Shortened URL
URL Length
Use of IP Address
HTTPS Presence
Suspicious Words
def suspicious_words(url):
match = re.search('login|bank|account|paypal|secure|free', url)
return 1 if match else 0
Redirects
Step 3: Model Training
With these features extracted, we moved on to model training using Logistic Regression. This model is a simple yet effective classification technique that works well for binary classification problems.
from sklearn.model_selection
import train_test_split
# Splitting data into training and testing sets
x = df[['domain_age', 'dns_record', 'shortening_service', 'use_of_ip', 'url_length', 'contains_https', 'suspicious_words', 'count_redirects']]
y = df['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Training the Logistic Regression model
model = LogisticRegression()model.fit(X_train, y_train)
Step 4: Evaluating the Model
We evaluated the model’s performance on the test data using accuracy and a classification report, which helped us understand its predictive power.
# Making predictions on the test set
y_pred = model.predict(X_test)
# Evaluating model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Conclusion
This phishing URL detection project allowed me to delve into the exciting field of cybersecurity and machine learning. By analyzing features such as domain age, DNS records, and suspicious URL patterns, we built an effective model capable of detecting phishing URLs with high accuracy. The experience was both challenging and rewarding, reinforcing the importance of feature engineering in building reliable machine learning models.
Stay Tuned for more details!
Comments