Let's say we have a binary that we have either reverse engineered or have the associated debugging information. We will refer to this executable as binary S. We also have another executable called P that we know is similar to binary S in some way but it is stripped and has no debugging information. Binary P could have a similar source code to binary S or could be made from the same statically linked library. However, S and P are sufficiently different that they cannot be diffed with BinDiff.
We want to use information from binary S to transpose symbols from S to P.
This blog post will explain how the AI we are developing at RevEng.AI does exactly that. We will look at one technique that explores our program comprehension embeddings API to locate symbols in P given a single known anchor point in it.
As a running example, S will be the binary miredo from the Debian Sid package miredo/amd64. Miredo is a Teredo client (RFC 4380) that tunnels IPv6 packets in IPv4 UDP packets. We don't have access to the source code but we can access symbol information via the miredo-dbgsym/amd64 package. It is important to note that RFC 4380 uses a combination of MD5 and HMAC in its protocol design. You can download binary S below.
miredoThe miredo binary from the miredo/amd64 Debian Sid package. With debug_info.miredo162 KB.a{fill:none;stroke:currentColor;stroke-linecap:round;stroke-linejoin:round;stroke-width:1.5px;}download-circle
As for executable P, we will be using the binary fdupes from the Debian Sid package fdupes/amd64. FDupes is a program that identifies duplicate files within directories using MD5 hashes and byte by byte comparisons. You can download binary P below.
fdupesFDupes binary from the fdupes/amd64 Debian Sid package.fdupes27 KB.a{fill:none;stroke:currentColor;stroke-linecap:round;stroke-linejoin:round;stroke-width:1.5px;}download-circle
Our intuition informs us that, given both binaries contain code for processing MD5, can we port MD5 related symbols from S to P?
An inspection of our source binary S reveals the following symbols:
1root@desyl:~# readelf -s /usr/sbin/miredo | grep -i md5
2 81: 0000000000000000 0 FILE LOCAL DEFAULT ABS md5.c
3 82: 0000000000006360 2113 FUNC LOCAL DEFAULT 14 md5_process
4 138: 0000000000006e10 194 FUNC GLOBAL DEFAULT 14 md5_finish
5 369: 0000000000006be0 552 FUNC GLOBAL DEFAULT 14 md5_append
6 396: 0000000000006bb0 42 FUNC GLOBAL DEFAULT 14 md5_initA quick internet search for "md5" and "fdupes" reveals that FDupes has the same md5_process, md5_finish, md5_append, and md5_init functions in its source code (Github search).
md5_init is a small 42 byte function and should be easy to identify in Ghidra. For the rest of this blog post we are going to assume that we know the location of a single symbol in P - md5_init. This will be our known anchor point in P.
We are going to use this anchor point with the relative difference between symbols' embedded representations from S to locate another symbol in P. Before we explain exactly what that means or why we would want to do that, take a look at md5.c and you will see that md5_process is not a direct caller or callee of md5_init and therefore cannot easily be identified by simple static rules.
We will now explain how to locate md5_process in the stripped binary P using md5_init and the relative difference between md5_process and md5_init in our source executable S.
RevEng.AI's BinNet model uses a deep multi-output transformer architecture to capture the semantic meaning of binary code. It represents binary code (and all strings, XREFS, constants, etc.) as real-valued vectors that can be used for downstream ML tasks.
To understand this properly you need to have a basic understanding of Word2Vec and the concept of word representations. We recommend pausing here and reading the following blog post if you are unfamiliar with modern NLP techniques.
The famous example from Word2Vec allowed for the vector representations of the words king - man + woman ≈ queen. This proved that the vector space generated by Word2Vec captured semantic information beyond what was previously possible by Neural Network Language Models (NNLMs).
You might be able to guess where this is going...
We are going to use RevEng.AI's BinNet model and embeddings API to locate md5_process in P by calculating the difference between the embedded representations of md5_process - md5_init in S and adding the representation of md5_init in P.
We will use the following Python code:
1from reveng import BinNet, Binary, Symbol
2
3bn = BinNet(apikey='freeware')
4S = Binary(path='/usr/bin/miredo')
5P = Binary(path='/usr/bin/fdupes')
6
7# run binary analysis, P is stripped
8# extract function boundaries from ghidra
9S.analyse()
10P.analyse(FBD='ghidra', stripped=True)
11
12# store embeddings in numpy matrix, shape=(N, 192)
13# N is number of symbols in binary
14# 192 is the number of dimensions for BinNet embeddings
15Xs = bn.get_embeddings(S.symbols)
16Xp = bn.get_embeddings(P.symbols)
17
18# get embeddings for each symbol
19# md5_init is at 0x00003f07 in fdupes
20s_md5_init = Xs[S.get_symbol('md5_init'), :]
21s_md5_process = Xs[S.get_symbol('md5_process'), :]
22p_md5_init = Xp[S.get_symbol_at_loc(0x00003f07), :]
23
24# build query vector
25query_vec = s_md5_process - s_md5_init + p_md5_initOur query vector is a real-valued numpy vector of 32 bit floats:
1In [216]: query_vec
2Out[216]:
3array([[-1.0754741 , -0.53394 , 2.5343153 , -0.8589107 , -0.80127776,
4 3.6827471 , -0.2754942 , -0.25956583, -0.9386326 , -0.8809752 ,
5 -0.94144917, -0.9657012 , -0.37274447, 3.551879 , 1.4170911 ,
6 ...
7 -0.5281741 , -0.65251297, -0.9238823 , 1.2517035 , -0.77191585,
8 -0.9819512 , -0.695347 , -0.47379667, -0.6443882 , 0.8159219 ,
9 -0.20785426, -0.52132106, -0.7369828 , 2.757089 , -0.07254058,
10 -1.1361468 , -0.45225745]], dtype=float32)We now need to find the closest symbol in P to our query vector. We do this using the cosine similarity against all functions that Ghidra found in P.
1from sklearn.metrics.pairwise import cosine_similarity
2
3closest = cosine_similarity(Xp, query_vec).squeeze().argsort()[::-1][0]The closest BinNet embeddings in P to our query can then be accessed by looking up the row corresponding to the closest index:
1In [235]: Xp[closest, :]
2Out[235]:
3array([[-1.03495793, -0.60625399, 2.51171437, -0.84164309, -0.83068409,
4 3.69240599, -0.31704739, -0.22428862, -1.0581016 , -0.96066797,
5 -0.87574955, -0.95887101, -0.40816309, 3.62616827, 1.39601726,
6 ...
7 -0.46266309, -0.58537404, -0.87706089, 1.23309646, -0.75341943,
8 -0.94693962, -0.58251833, -0.45131244, -0.71748744, 0.77900041,
9 -0.25757586, -0.56068685, -0.74393312, 2.77102427, -0.04606016,
10 -1.16797348, -0.41193305]], dtype=float32)And now we can find the matched symbols virtual address:
1In [236]: P.symbols[closest].vaddr
2Out[236]: 14068The symbol md5_process should have a virtual address of 14068 (0x36f4) in our binary P. If we now download symbol information from the fdupes-dbgsym/amd64 Debian package and merge the ELF files we indeed find this is correct.
OK, BinDiff doesn't work but can't we just hash a function's disassembly code and find matches between the two functions? Can we use IDA FLIRT?
Let's download the debugging package and inspect P with symbols. The first thing to note is that md5_process is 2113 bytes in S and 2067 bytes in P. Matching hashes clearly won't work.
Is the disassembly similar? Let's take a look with Rizin:
1[0x000013b0]> pdf @ sym.md5_process
2┌ 2067: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh);
3│ 0x000036f4 4157 push r15
4│ 0x000036f6 4156 push r14
5│ 0x000036f8 4155 push r13
6│ 0x000036fa 4154 push r12
7│ 0x000036fc 55 push rbp
8│ 0x000036fd 53 push rbx
9│ 0x000036fe 8b4708 mov eax, dword [rdi + 8]
10│ 0x00003701 894424b8 mov dword [rsp - 0x48], eax
11│ 0x00003705 8b470c mov eax, dword [rdi + 0xc]
12│ 0x00003708 89442498 mov dword [rsp - 0x68], eax
13│ 0x0000370c 8b4710 mov eax, dword [rdi + 0x10]
14│ 0x0000370f 8944249c mov dword [rsp - 0x64], eax
15│ 0x00003713 8b4714 mov eax, dword [rdi + 0x14]
16│ 0x00003716 894424a0 mov dword [rsp - 0x60], eax
17│ 0x0000371a 4889f0 mov rax, rsi
18│ 0x0000371d 40f6c603 test sil, 3
19│ ┌─< 0x00003721 744c je 0x376f
20│ │ 0x00003723 488b06 mov rax, qword [rsi]
21│ │ 0x00003726 48894424c0 mov qword [rsp - 0x40], rax
22│ │ 0x0000372b 488b4608 mov rax, qword [rsi + 8] 1[0x00003560]> pdf @ sym.md5_process
2┌ 2113: sym.md5_process (int64_t arg1, int64_t arg2, int64_t arg_6fa87e4fh);
3| 0x00006360 4157 push r15
4│ 0x00006362 4156 push r14
5│ 0x00006364 4155 push r13
6│ 0x00006366 4154 push r12
7│ 0x00006368 55 push rbp
8│ 0x00006369 53 push rbx
9│ 0x0000636a 4881ec880000. sub rsp, 0x88
10│ 0x00006371 64488b042528. mov rax, qword fs:[0x28]
11│ 0x0000637a 4889442478 mov qword [var_78h], rax
12│ 0x0000637f 31c0 xor eax, eax
13│ 0x00006381 8b4708 mov eax, dword [rdi + 8]
14│ 0x00006384 89442428 mov dword [var_28h], eaxThe disassembly is significantly different and would not match with hashing or semantic hashing. IDA FLIRT works by matching the first 32 bytes of each function and would fail to find the match.
Why would we want to do this? RevEng.AI's BinNet embeddings aren't just useful for matching symbol names. They inherently capture the semantic meaning of binary code and could be used for:
There are far more efficient ways to use RevEng.AI's API to identify symbols in stripped binaries but this has been a fun example. If you are interested in using our API please contact us via the contact page at https://reveng.ai.
No source code was analysed by our AI in the development of this example.
We use essential cookies for security and site functionality, plus optional tracking cookies to understand how you use our site and improve it. Privacy policy.