Deep-pwning is a lightweight framework for experimenting with machine learning models with the goal of evaluating their robustness against a motivated adversary.
Note that deep-pwning in its current state is nowhere close to maturity or completion. It is meant to be experimented with, expanded upon, and extended by you. Only then can we help it truly become the goto penetration testing toolkit for statistical machine learning models.
This tool was released at DEF CON 24 in Las Vegas, August 2016, during a talk titled Machine Duping 101: Pwning Deep Learning Systems.
All of the included examples and code implement deep neural networks, but they can be used to generate adversarial images for similarly tasked classifiers that are not implemented with deep neural networks. This is because of the phenomenon of ‘transferability’ in machine learning, which was Papernot et al. expounded expertly upon in this paper. This means that adversarial samples crafted with a DNN model A may be able to fool another distinctly structured DNN model B, as well as some other SVM model C.
This figure taken from the aforementioned paper (Papernot et al.) shows the percentage of successful adversarial misclassification for a source model (used to generate the adversarial sample) on a target model (upon which the adversarial sample is tested).
Deep-pwning is modularized into several components to minimize code repetition. Because of the vastly different nature of potential classification tasks, the current iteration of the code is optimized for classifying images and phrases (using word vectors).
These are the code modules that make up the current iteration of Deep-pwning:
- DriversThe drivers are the main execution point of the code. This is where you can tie the different modules and components together, and where you can inject more customizations into the adversarial generation processes.
This is where the actual machine learning model implementations are located. For example, the provided lenet5 model definition is located in the model() function within lenet5.py. It defines the network as the following:
- Adversarial (advgen)This module contains the code that generates adversarial output for the models. The run() function defined in each of these advgen classes takes in an input_dict, that contains several predefined tensor operations for the machine learning model defined in Tensorflow. If the model that you are generating the adversarial sample for is known, the variables in the input dict should be based off that model definition. Else, if the model is unknown, (black box generation) a substitute model should be used/implemented, and that model definition should be used. Variables that need to be passed in are the input tensor placeholder variables and labels (often referred to as
x-> input and
y_-> labels), the model output (often referred to as
y_conv), and the actual test data and labels that the adversarial images will be based off of.
- ConfigApplication configurations.
- UtilsMiscellaneous utilities that don’t belong anywhere else. These include helper functions to read data, deal with Tensorflow queue inputs etc.
- These are the resource directories relevant to the application:
- CheckpointsTensorflow allows you to load a partially trained model to resume training, or load a fully trained model into the application for evaluation or performing other operations. All these saved ‘checkpoints’ are stored in this resource directory.
- DataThis directory stores all the input data in whatever format that the driver application takes in.
- OutputThis is the output directory for all application output, including adversarial images that are generated.
git clone https://github.com/cchio/deep-pwning.git
pip install -r requirements.txt
Execution Example (with the MNIST driver)
To restore from a previously trained checkpoint. (configuration in config/mnist.conf)
To train from scratch. (note that any previous checkpoint(s) located in the folder specified in the configuration will be overwritten)
Copyright (c) 2016 Clarence Chio