Skip to content

Commit f8d9271

Browse files
committed
Merge branch 'main' into publish
2 parents a0448a1 + c6b4a5b commit f8d9271

File tree

13 files changed

+1283
-54
lines changed

13 files changed

+1283
-54
lines changed

Readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# README
22

3-
>Note: The complete documentation for this project will eventually be held as a hugo website.
3+
>Note: The complete documentation for this project can be found at https://thixotropist.github.io/ghidra_advisor/
44
55
What tool might help Ghidra users interpret confusing listing and decompiler views?
66
Modern compilers can optimize simple C code into confusing instruction sequences.

advisor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,4 +98,4 @@ def __init__(self, source=None):
9898
args = parser.parse_args()
9999
with open(args.filename, 'r', encoding='utf8') as f:
100100
advisor = Advisor(f.read())
101-
print('\n'.join(advisor.report))
101+
print('\n'.join(advisor.report))

analytics.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ def __init__(self, disassembly_fragment: str, mode:DisassemblyMode):
5050
label = m.group(1)
5151
self.logger.debug("Found an interior label: %s", m.group(1))
5252
self.opcode_handler.add_label(label)
53+
# finish by adding traits
54+
if 'has_loop' in self.opcode_handler.context:
55+
self.opcode_handler.significant_opcodes.append('_loop')
5356

5457
def display(self):
5558
"""
@@ -71,10 +74,6 @@ def get_signature(self, name:str):
7174
"""
7275
return a single signature name and signature
7376
"""
74-
if name == 'traits':
75-
if 'has_loop' in self.opcode_handler.context:
76-
return 'has_loop'
77-
return 'has_no_loop'
7877
if name == 'Opcodes, ordered':
7978
return ','.join(self.opcode_handler.significant_opcodes)
8079
if name == 'Opcodes, sorted':
@@ -86,7 +85,6 @@ def get_signature(self, name:str):
8685
def get_signatures(self):
8786
"Return the collected signatures as a dict"
8887
sigs = {}
89-
sigs['traits'] = self.get_signature('traits')
9088
sigs['Opcodes, ordered'] = self.get_signature('Opcodes, ordered')
9189
sigs['Opcodes, sorted'] = self.get_signature('Opcodes, sorted')
9290
return sigs

content/en/docs/Developer_reference/Dependencies.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Without this branch Ghidra is stuck with a never-ratified older version of RISCV
1111

1212
## Bazel
1313

14-
* [Bazel 11.4](https://github.com/bazelbuild/bazel/releases)
14+
* [Bazel 7.4](https://github.com/bazelbuild/bazel/releases)
1515

1616
Bazel builds in this workspace generate output in the temporary directory /run/user/1000/bazel, as specified in .bazelrc.
1717
This override can be changed or removed.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: Whisper Exploration Example
3+
linkTitle: Whisper Exploration
4+
weight: 20
5+
---
6+
7+
How do we move forward when the Advisor doesn't provide a lot of help? We'll start with an example
8+
taken from Whisper.cpp's main routine.
9+
10+
```cpp
11+
pFVar26 = local_460;
12+
vsetivli_e64m1tama(2);
13+
local_700 = (FILE *)local_c0._40_8_;
14+
local_6f8 = pFVar34->_IO_read_ptr;
15+
auVar45 = vle64_v(avStack_470);
16+
vmv_v_i(in_v4,0);
17+
auVar46 = vle64_v(&local_700);
18+
vse64_v(in_v4,avStack_470);
19+
auVar47 = vslidedown_vi(auVar45,1);
20+
auVar46 = vslidedown_vi(auVar46,1);
21+
local_4e0 = (FILE *)vmv_x_s(auVar46);
22+
auVar46 = vle64_v(&local_700);
23+
pcVar20 = (char *)vmv_x_s(auVar47);
24+
local_4d8 = local_88;
25+
local_c0._40_8_ = vmv_x_s(auVar45);
26+
pFVar34->_IO_read_ptr = pcVar20;
27+
local_4e8 = (FILE *)vmv_x_s(auVar46);
28+
local_460 = (FILE *)0x0;
29+
local_88 = pFVar26;
30+
std::vector<>::~vector((vector<> *)&local_4e8);
31+
std::vector<>::~vector(avStack_470);
32+
std::_Rb_tree<>::_M_erase((_Rb_tree_node *)local_490);
33+
```
34+
35+
This clearly isn't a loop. Instead it is some sort of initialization sequence that allows
36+
vector instructions to slightly optimize the code. The advisor results aren't very helpful:
37+
38+
```text
39+
Signatures:
40+
41+
Vector length set to = 0x2
42+
Element width is = 64 bits
43+
Vector load: vle64.v
44+
Vector load: vle64.v
45+
Vector store: vse64.v
46+
Vector integer slidedown: vslidedown.vi
47+
Vector integer slidedown: vslidedown.vi
48+
Vector load: vle64.v
49+
Significant operations, in the order they appear:
50+
vsetivli,vle64.v,vmv.v.i,vle64.v,vse64.v,vslidedown.vi,vslidedown.vi,vmv.x.s,vle64.v,vmv.x.s,vmv.x.s,vmv.x.s
51+
Significant operations, in alphanumeric order:
52+
vle64.v,vle64.v,vle64.v,vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vmv.x.s,vse64.v,vsetivli,vslidedown.vi,vslidedown.vi
53+
54+
Similarity Analysis
55+
56+
Compare the clipped example to the database of vectorized examples.
57+
58+
The best match is id=1889 [0.652]= vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vse8.v,vsetivli,vsetivli,vsetivli,vsetvli
59+
60+
The clip is similar to the reference example data/custom_testsuite/builtins/string_rv64gcv:bzero_15
61+
62+
```
63+
64+
This suggests several Advisor improvements:
65+
66+
* explicitly report that no loops are found, and that the stanza is likely a vector optimization of
67+
scalar instruction transforms.
68+
* add a quick explanation of what vslidedown.vi does
69+
* the vmv instructions need annotation, especially any that load constants into registers.
70+
71+
A manual analysis suggests that the vector instructions manipulate pairs of 64 bit pointers,
72+
variously copying them, zeroing them, or copying first or second elements of the pair into
73+
scalar registers. That probably means we want simple C++ vector manipulation functions in our
74+
set of custom patterns.
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
---
2+
title: Whisper Memcpy
3+
linkTitle: Whisper Memcpy
4+
weight: 50
5+
---
6+
7+
Is it easy to recognize vector expansions of libc functions like `memcpy`?
8+
9+
Let's locate some explicit invocations of `memcpy` within Whisper and
10+
see what the Advisor has to say.
11+
12+
```c++
13+
struct whisper_context * whisper_init_from_buffer_with_params_no_state(void * buffer, size_t buffer_size, struct whisper_context_params params) {
14+
struct buf_context {
15+
uint8_t* buffer;
16+
size_t size;
17+
size_t current_offset;
18+
};
19+
loader.read = [](void * ctx, void * output, size_t read_size) {
20+
buf_context * buf = reinterpret_cast<buf_context *>(ctx);
21+
22+
size_t size_to_copy = buf->current_offset + read_size < buf->size ? read_size : buf->size - buf->current_offset;
23+
24+
memcpy(output, buf->buffer + buf->current_offset, size_to_copy);
25+
buf->current_offset += size_to_copy;
26+
27+
return size_to_copy;
28+
};
29+
};
30+
```
31+
32+
This source example shows a few traits:
33+
34+
* the number of bytes to copy is not in general known at compile time
35+
* the buffer type is `uint8_t*`
36+
* there are no alignment guarantees
37+
38+
GCC 15 compiles the lambda stored in loader.read as
39+
`whisper_init_from_buffer_with_params_no_state::{lambda(void*,void*,unsigned_long)#1}::_FUN`.
40+
The relevant instruction sequence (trimmed of address and whitespace) is:
41+
42+
```as
43+
LAB_000b0be2
44+
vsetvli a3,param_3,e8,m8,ta,ma
45+
vle8.v v8,(a4)
46+
c.sub param_3,a3
47+
c.add a4,a3
48+
vse8.v v8,(param_2)
49+
c.add param_2,a3
50+
c.bnez param_3,LAB_000b0be2
51+
```
52+
53+
Copying the Ghidra listing to the clipboard and running the Advisor gives us:
54+
55+
```text
56+
Clipboard Contents to Analyze
57+
58+
LAB_000b0be2 XREF[1]: 000b0bf4(j)
59+
000b0be2 d7 76 36 0c vsetvli a3,param_3,e8,m8,ta,ma
60+
000b0be6 07 04 07 02 vle8.v v8,(a4)
61+
000b0bea 15 8e c.sub param_3,a3
62+
000b0bec 36 97 c.add a4,a3
63+
000b0bee 27 84 05 02 vse8.v v8,(param_2)
64+
000b0bf2 b6 95 c.add param_2,a3
65+
000b0bf4 7d f6 c.bnez param_3,LAB_000b0be2
66+
67+
Signatures:
68+
69+
Element width is = 8 bits
70+
Vector registers are grouped with MUL = 8
71+
Vector load: vle8.v
72+
Vector load is to multiple registers
73+
Vector store: vse8.v
74+
Vector store is from multiple registers
75+
At least one loop exists
76+
Significant operations, in the order they appear:
77+
vsetvli,vle8.v,vse8.v,_loop
78+
Significant operations, in alphanumeric order:
79+
_loop,vle8.v,vse8.v,vsetvli
80+
81+
Similarity Analysis
82+
83+
Compare the clipped example to the database of vectorized examples.
84+
85+
The best match is id=1873 [1.000]= _loop,vle8.v,vse8.v,vsetvli
86+
87+
The clip is similar to the reference example data/custom_testsuite/builtins/memcpy_rv64gcv:memcpy_255
88+
Reference C Source
89+
90+
void memcpy_255()
91+
{
92+
__builtin_memcpy (to, from, 255);
93+
};
94+
95+
Reference Compiled Assembly Code
96+
97+
65e auipc a3,0x2
98+
662 ld a3,-1678(a3)
99+
666 auipc a2,0x0
100+
66a addi a2,a2,82
101+
66e li a4,255
102+
672 vsetvli a5,a4,e8,m8,ta,ma
103+
676 vle8.v v8,(a2)
104+
67a sub a4,a4,a5
105+
67c add a2,a2,a5
106+
67e vse8.v v8,(a3)
107+
682 add a3,a3,a5
108+
```
109+
110+
The Advisor has matched the vector instruction loop to the GCC `__builtin_memcpy` test case where the
111+
number of bytes to transfer is large (255). The individual scalar instructions are not the same.
112+
113+
This example shows something important that we probably want to add to the Advisor's report:
114+
115+
The `vsetvli` instruction includes the `m8` multiplier option, which means vector operations cover groups of 8
116+
registers. The `vle8.v` only references vector register `v8`, but the loads and stores affect the 8
117+
registers `v8` through `v15`. If the `__builtin_memcpy` appeared in an inline code fragment, where
118+
there may be more pressure on vector register availability, we might have seen very similar code
119+
with multipliers of `m4`, `m2`, or `m1`.
120+
121+
What does the Ghidra decompiler show for this instruction sequence?
122+
123+
```c
124+
do {
125+
lVar3 = vsetvli_e8m8tama(uVar1);
126+
auVar4 = vle8_v(lVar2);
127+
uVar1 = uVar1 - lVar3;
128+
lVar2 = lVar2 + lVar3;
129+
vse8_v(auVar4,param_2);
130+
param_2 = (void *)((long)param_2 + lVar3);
131+
} while (uVar1 != 0);
132+
```
133+
134+
What would we like Ghidra's decompiler to show instead? Something like:
135+
136+
```c
137+
__builtin_memcpy(param_2, lvar2, uVar1);
138+
```
139+
140+
That's not quite correct, as `__builtin_memcpy` doesn't mutate the values `param_2` or `lvar2`.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: Whisper Output Forensics
3+
linkTitle: Whisper Output Forensics
4+
weight: 101
5+
---
6+
7+
You might expect Whisper to use a lot of vector instructions in its inference engine, and it definitely does.
8+
Are vector instructions common enough to complicate Whisper forensic analysis, looking at functions an
9+
adversary is likeliest to target? For this example we will make a deep dive into the function `output_txt`,
10+
since malicious code might want to review and alter dictated text.
11+
12+
This code also lets us examine how RISCV vector instructions are used to implement the `libstdc++` vector library functions.
13+
14+
```c++
15+
const char * whisper_full_get_segment_text(struct whisper_context * ctx, int i_segment) {
16+
return ctx->state->result_all[i_segment].text.c_str();
17+
}
18+
19+
static bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
20+
std::ofstream fout(fname);
21+
if (!fout.is_open()) {
22+
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
23+
return false;
24+
}
25+
fprintf(stderr, "%s: saving output to '%s'\n", __func__, fname);
26+
const int n_segments = whisper_full_n_segments(ctx);
27+
for (int i = 0; i < n_segments; ++i) {
28+
const char * text = whisper_full_get_segment_text(ctx, i);
29+
std::string speaker = "";
30+
if (params.diarize && pcmf32s.size() == 2)
31+
{
32+
const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
33+
const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
34+
speaker = estimate_diarization_speaker(pcmf32s, t0, t1);
35+
}
36+
fout << speaker << text << "\n";
37+
}
38+
return true;
39+
}
40+
```
41+
42+
The key elements of this function are:
43+
44+
* text is collected in segments and stored in the context variable `ctx`
45+
* text segments are retrieved with the function `whisper_full_get_segment_text`
46+
* text is copied into an output stream `fout`
47+
48+
The `params.diarize` code block matters only if voice is collected
49+
in stereo and Whisper has been asked to differentiate between speakers.
50+
51+
The Ghidra decompiler shows four vector instruction sets starting with a `vset*` instruction.
52+
The first of these is a simple initialization:
53+
54+
```c
55+
vsetivli_e64m1tama(2);
56+
vmv_v_i(in_v1,0);
57+
vse64_v(in_v1,auStack_90);
58+
vse64_v(in_v1,auStack_80);
59+
```
60+
61+
These instructions initialize two adjacent 16 byte blocks of memory to zero. These are likely
62+
four 64 bit pointers or counters embedded within structures.
63+
64+
The next vector stanza is:
65+
66+
```c
67+
vsetivli_e64m1tama(2);
68+
lStack_2e0 = lStack_288;
69+
uStack_2d8 = local_280[0];
70+
auVar24 = vle64_v(&lStack_2e0);
71+
auVar25 = vle64_v(&lStack_2e0);
72+
auVar24 = vslidedown_vi(auVar24,1);
73+
lStack_2a8 = vmv_x_s(auVar25);
74+
local_2a0[0] = vmv_x_s(auVar24);
75+
```
76+
77+
This one is puzzling, as it appears to load two 64 bit values twice, then store them into separate scalar registers.
78+
79+
The next stanza looks like a simple `memcpy` expansion:
80+
81+
```c
82+
do {
83+
lVar18 = vsetvli_e8m8tama(lStack_288);
84+
auVar24 = vle8_v(puVar16);
85+
lStack_288 = lStack_288 - lVar18;
86+
puVar16 = (ulong *)((long)puVar16 + lVar18);
87+
vse8_v(auVar24,puVar20);
88+
puVar20 = (ulong *)((long)puVar20 + lVar18);
89+
} while (lStack_288 != 0);
90+
```
91+
92+
The final stanza is the interesting one:
93+
94+
```c
95+
pcVar17 = text;
96+
do {
97+
vsetvli_e8m1tama(0);
98+
pcVar17 = pcVar17 + lVar18;
99+
auVar24 = vle8ff_v(pcVar17);
100+
auVar24 = vmseq_vi(auVar24,0);
101+
lVar19 = vfirst_m(auVar24);
102+
lVar18 = in_vl;
103+
} while (lVar19 < 0);
104+
std::__ostream_insert<>(pbVar12,text,(long)(pcVar17 + (lVar19 - (long)text)));
105+
```
106+
107+
This appears to be a vector implementation of `strlen(text)` requested by `std::__ostream_insert<>`.
108+
109+
Our hypothetical adversary would want to evaluate `*text` and reset the `text` pointer to the maliciously altered output string.
110+
111+
The current Advisor classifies these four stanzas as:
112+
113+
* some sort of initializer
114+
* some sort of shuffle
115+
* `memcpy`
116+
* `strlen`
117+
118+
A Ghidra user would likely ignore the initializer and the shuffle as doing something benign and obscure within the I/O subsystem,
119+
recognize the `memcpy` and `strlen` for what they are, then concentrate on any unexpected manipulations of the `*text` string.

content/en/docs/Examples/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Ghidra Advisor Examples
3-
linkTitle: Documents
3+
linkTitle: Examples
44
weight: 10
55
---
66

0 commit comments

Comments
 (0)