polypyus: locate functions in raw binaries
Polypyus Firmware Historian
Polypyus learns to locate functions in raw binaries by extracting known functions from similar binaries. Thus, it is a firmware historian. Polypyus works without disassembling these binaries, which is an advantage for binaries that are complex to disassemble and where common tools miss functions. In addition, the binary-only approach makes it very fast and run within a few seconds. However, this approach requires the binaries to be for the same architecture and have similar compiler options.
Polypyus integrates into the workflow of existing tools like Ghidra, IDA, BinDiff, and Diaphora. For example, it can import previously annotated functions and learn from these, and also export found functions to be imported into IDA. Since Polypyus uses rather strict thresholds, it only found correct matches in our experiments. While this leads to fewer results than in existing tools, it is a good entry point for loading these matches into IDA to improve its auto analysis results and then run BinDiff on top.
What Polypyus solves
When working on raw firmware binaries, namely various Broadcom and Cypress Bluetooth firmware versions, we found that IDA auto analysis often identified function starts incorrectly. In IDA Pro 6.8 the auto analysis is a bit more aggressive, leading to more results but also more false positives. Overall, IDA Pro 7.2 was more pessimistic but missed a lot of functions. This led to only a few BinDiff matches between our firmwares in IDA Pro 6.8 and no useful matches at all in IDA Pro 7.2.
Interestingly, BinDiff often failed to identify functions that, except from branches, were byte-identical. Note that Polypyus searches exactly for these byte-identical functions. We assume that BinDiff fails at these functions due to a different call graph produced by missing functions and false positives. Sometimes, these functions were already recognized by IDA, but often, IDA did either not recognize these as code or not mark them as function. Note that Diaphora has similar problems, as it exports functions identified by IDA before further processing them.
Moreover, while we found that Amnesia finds many functions, it also finds many false positives. However, many functions have a similar stack frame set up in the beginning. Thus, Polypyus has an option to learn common function starts from the annotated input binaries and apply this to other binaries to identify functions without matching their name. This optional step is only applied to the regions in which no functions were previously located, this way the common function starts method and the main function finding does not conflict.
How it works
Polypyus creates fuzzy binary matchers by comparing common functions in a collection of annotated firmware binaries.
Currently, the following annotations are supported:
- A WICED Studio
patch.elffile, which is a special ELF file containing only symbol definitions.
.symdefsfile as it is produced by most ARM compilers.
.csvfile with a format documented in the examples.
These annotations contain the address, size, and name of known functions. The more commonalities the input binaries in the history collection have, the better for Polypyus performance and results. Given several slightly different functions, Polypyus creates very good matchers.
Copyright (C) 2020 seemoo-lab