The growing volume of malware circulating daily, combined with its increasing structural diversity, presents substantial challenges for automated malware analysis. Machine-learning classifiers, widely adopted for malware detection and family classification, often struggle to generalize when confronted with new malware families or datasets affected by labeling inconsistencies, class imbalance, and incomplete feature sets.
Beyond classification, the structural variability observed within malware families complicates similarity measurement. Malware authors actively use techniques such as packing and deliberate binary modifications to generate diverse variants from the same code base.
In addition to these challenges, many malware datasets collected from live feeds contain truncated samples—files that are incomplete due to errors during collection or transmission. While truncation is not a source of meaningful diversity, it introduces noise that pollutes datasets and wastes analysis resources when processed by tools or sandboxes.
At the same time, unrelated malware samples often display misleading structural similarities due to common build environments, shared packers, and recurring compiler toolchains. These inter-family artifacts undermine the precision of static similarity features, leading to clustering errors and incorrect associations between distinct malware families.
This thesis addresses these challenges through a measurement-driven investigation of malware diversity across three perspectives: the impact of dataset composition and feature selection on machine-learning classifiers, the extent and nature of intra-family polymorphism in malware binaries, and the structural factors driving false similarities between unrelated families. Together, these studies provide a comprehensive empirical foundation for improving the design, evaluation, and reliability of malware classification and similarity analysis techniques.