izvorni znanstveni rad

A Lightweight Approach to Spam Link Identification Using Automated Domain Reputation Scoring

Davor Cafuta, Danijela Pongrac, Brigitta Cafuta, Ivica Dodig

Sažetak

This paper examines the extraction and reputation analysis of domains identified within an email dataset containing both spam and ham messages. Using the Enron corpus, data processing was conducted in a virtualised Ubuntu environment (VMware) to ensure security. A custom Python application extracted domains from the message bodies, with the structured output stored in .parquet format for subsequent analysis on the host system. The reputation of each identified domain was assessed using the Open PageRank API. The methodology includes comprehensive statistical analysis, such as calculating average scores and examining the distribution of domains across a categorised ranking scale (1–10). Experimental results reveal a significant disparity between the two categories, with domains associated with spam exhibiting a notably lower average reputation. The study outlines the technical implementation, the technology stack, and provides an evaluation of the benefits and limitations of this approach, highlighting its potential application in improving automated spam filtering systems

Ključne riječi

Open PageRankEnron datasetDomain Reputation