DroidDetective: machine learning malware analysis framework for Android apps

DroidDetective

DroidDetective is a Python tool for analysing Android applications (APKs) for potential malware-related behaviour and configurations. When provided with a path to an application (APK file) Droid Detective will make a prediction (using it’s ML model) if the application is malicious. Features and qualities of Droid Detective include:

  • Analysing which of ~330 permissions are specified in the application’s AndroidManifest.xml file. 🙅
  • Analysing the number of standard and proprietary permissions in use in the application’s AndroidManifest.xml file. 🧮
  • Using a RandomForest machine learning classifier, trained off the above data, from ~14 malware families and ~100 Google Play Store applications. 💻

Data Science | The ML Model

DroidDetective is a Python tool for analyzing Android applications (APKs) for potential malware-related behaviour. This works by training a Random Forest classifier on information derived from both known malware APKs and standard APKs available on the Android app store. This tooling comes pre-trained, however, the model can be re-trained on a new dataset at any time. ⚙️

This model currently uses permissions from an APKs AndroidManifest.xml file as a feature set. This works by creating a dictionary of each standard Android permission and setting the feature to 1 if the permission is present in the APK. Similarly, a feature is added for the number of permissions in use in the manifest and for the number of unidentified permissions found in the manifest.

The pre-trained model was trained off approximately 14 malware families (each with one or more APK files), located from ashisdb’s repository, and approximately 100 normal applications located in the Google Play Store.

The below denotes the statistics for this ML model:

Accuracy: 0.9310344827586207

Recall: 0.9166666666666666
Precision: 0.9166666666666666
F-Measure: 0.9166666666666666

 

The top 10 highest weighted features (i.e. Android permissions) used by this model, for identifying malware, can be seen below:

Copyright (C) 2022 James Stevenson