Overview
Explore an open-source malware classifier and dataset in this conference talk from BSidesSF 2018. Delve into the challenges of machine learning for static malware detection due to limited public datasets. Learn about a new open-source dataset of labels for diverse Windows PE files, including feature vectors for model building and a pre-trained model for research. Discover the reasoning behind feature selection and labeling, and witness the model's performance on real-world samples. Gain insights into the Ember dataset, its naming convention, and the training set composition. Examine two types of features, their calculation methods, and various categories such as section information, strings, and file size. Understand feature vectorization, model training, and scoring processes. Explore the code base, Python notebook, and feature engineering techniques. Investigate semisupervised learning and offensive research applications. Conclude with a live demonstration showcasing data download, packed samples analysis, and metadata examination.
Syllabus
Introduction
Why Open Datasets
Amnesty
Security Locks Datasets
Malware Classification
Ember
The Name
The Dataset
The Training Set
The Data
Two Types of Features
Calculating Features
Categories of Features
Section Information
Strings
File Size
Feature Vectorization
Training a Model
Scoring the Model
Disclaimer
Code Base
Python Notebook
Feature Engineering
Semisupervised Learning
Offensive Research
Demo Time
Hat
Download Data
Packed Samples
Metadata
Taught by
Security BSides San Francisco