thixotropist
diff --git a/‎Readme.md‎
Lines changed: 1 addition & 1 deletion b/‎Readme.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎advisor.py‎
Lines changed: 1 addition & 1 deletion b/‎advisor.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎analytics.py‎
Lines changed: 3 additions & 5 deletions b/‎analytics.py‎
Lines changed: 3 additions & 5 deletions
diff --git a/‎content/en/docs/Developer_reference/Dependencies.md‎
Lines changed: 1 addition & 1 deletion b/‎content/en/docs/Developer_reference/Dependencies.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/en/docs/Examples/Whisper_exploration_1.md‎
Lines changed: 74 additions & 0 deletions b/‎content/en/docs/Examples/Whisper_exploration_1.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎content/en/docs/Examples/Whisper_memcpy.md‎
Lines changed: 140 additions & 0 deletions b/‎content/en/docs/Examples/Whisper_memcpy.md‎
Lines changed: 140 additions & 0 deletions
diff --git a/‎content/en/docs/Examples/Whisper_output_forensics.md‎
Lines changed: 119 additions & 0 deletions b/‎content/en/docs/Examples/Whisper_output_forensics.md‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎content/en/docs/Examples/_index.md‎
Lines changed: 1 addition & 1 deletion b/‎content/en/docs/Examples/_index.md‎
Lines changed: 1 addition & 1 deletion
@@ -1,6 +1,6 @@
 # README
 
->Note: The complete documentation for this project will eventually be held as a hugo website.
+>Note: The complete documentation for this project can be found at https://thixotropist.github.io/ghidra_advisor/
 
 What tool might help Ghidra users interpret confusing listing and decompiler views?
 Modern compilers can optimize simple C code into confusing instruction sequences.
 
@@ -98,4 +98,4 @@ def __init__(self, source=None):
     args = parser.parse_args()
     with open(args.filename, 'r', encoding='utf8') as f:
         advisor = Advisor(f.read())
-        print('\n'.join(advisor.report))
+        print('\n'.join(advisor.report))
@@ -50,6 +50,9 @@ def __init__(self, disassembly_fragment: str, mode:DisassemblyMode):
                     label = m.group(1)
                     self.logger.debug("Found an interior label: %s", m.group(1))
                     self.opcode_handler.add_label(label)
+        # finish by adding traits
+        if 'has_loop' in self.opcode_handler.context:
+            self.opcode_handler.significant_opcodes.append('_loop')
 
     def display(self):
         """
@@ -71,10 +74,6 @@ def get_signature(self, name:str):
         """
         return a single signature name and signature
         """
-        if name == 'traits':
-            if 'has_loop' in self.opcode_handler.context:
-                return 'has_loop'
-            return 'has_no_loop'
         if name == 'Opcodes, ordered':
             return ','.join(self.opcode_handler.significant_opcodes)
         if name == 'Opcodes, sorted':
@@ -86,7 +85,6 @@ def get_signature(self, name:str):
     def get_signatures(self):
         "Return the collected signatures as a dict"
         sigs = {}
-        sigs['traits'] = self.get_signature('traits')
         sigs['Opcodes, ordered'] = self.get_signature('Opcodes, ordered')
         sigs['Opcodes, sorted'] = self.get_signature('Opcodes, sorted')
         return sigs
@@ -11,7 +11,7 @@ Without this branch Ghidra is stuck with a never-ratified older version of RISCV
 
 ## Bazel
 
-* [Bazel 11.4](https://github.com/bazelbuild/bazel/releases)
+* [Bazel 7.4](https://github.com/bazelbuild/bazel/releases)
 
 Bazel builds in this workspace generate output in the temporary directory /run/user/1000/bazel, as specified in .bazelrc.
 This override can be changed or removed. 
 
@@ -0,0 +1,74 @@
+---
+title: Whisper Exploration Example
+linkTitle: Whisper Exploration
+weight: 20
+---
+
+How do we move forward when the Advisor doesn't provide a lot of help?  We'll start with an example
+taken from Whisper.cpp's main routine.
+
+```cpp
+pFVar26 = local_460;
+vsetivli_e64m1tama(2);
+local_700 = (FILE *)local_c0._40_8_;
+local_6f8 = pFVar34->_IO_read_ptr;
+auVar45 = vle64_v(avStack_470);
+vmv_v_i(in_v4,0);
+auVar46 = vle64_v(&local_700);
+vse64_v(in_v4,avStack_470);
+auVar47 = vslidedown_vi(auVar45,1);
+auVar46 = vslidedown_vi(auVar46,1);
+local_4e0 = (FILE *)vmv_x_s(auVar46);
+auVar46 = vle64_v(&local_700);
+pcVar20 = (char *)vmv_x_s(auVar47);
+local_4d8 = local_88;
+local_c0._40_8_ = vmv_x_s(auVar45);
+pFVar34->_IO_read_ptr = pcVar20;
+local_4e8 = (FILE *)vmv_x_s(auVar46);
+local_460 = (FILE *)0x0;
+local_88 = pFVar26;
+std::vector<>::~vector((vector<> *)&local_4e8);
+std::vector<>::~vector(avStack_470);
+std::_Rb_tree<>::_M_erase((_Rb_tree_node *)local_490);
+```
+
+This clearly isn't a loop.  Instead it is some sort of initialization sequence that allows
+vector instructions to slightly optimize the code.  The advisor results aren't very helpful:
+
+```text
+Signatures:
+
+    Vector length set to = 0x2
+    Element width is = 64 bits
+    Vector load: vle64.v
+    Vector load: vle64.v
+    Vector store: vse64.v
+    Vector integer slidedown: vslidedown.vi
+    Vector integer slidedown: vslidedown.vi
+    Vector load: vle64.v
+    Significant operations, in the order they appear:
+        vsetivli,vle64.v,vmv.v.i,vle64.v,vse64.v,vslidedown.vi,vslidedown.vi,vmv.x.s,vle64.v,vmv.x.s,vmv.x.s,vmv.x.s
+    Significant operations, in alphanumeric order:
+        vle64.v,vle64.v,vle64.v,vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vmv.x.s,vse64.v,vsetivli,vslidedown.vi,vslidedown.vi
+
+Similarity Analysis
+
+Compare the clipped example to the database of vectorized examples.
+
+The best match is id=1889 [0.652]= vmv.v.i,vmv.x.s,vmv.x.s,vmv.x.s,vse8.v,vsetivli,vsetivli,vsetivli,vsetvli
+
+The clip is similar to the reference example data/custom_testsuite/builtins/string_rv64gcv:bzero_15
+
+```
+
+This suggests several Advisor improvements:
+
+* explicitly report that no loops are found, and that the stanza is likely a vector optimization of
+  scalar instruction transforms.
+* add a quick explanation of what vslidedown.vi does
+* the vmv instructions need annotation, especially any that load constants into registers.
+
+A manual analysis suggests that the vector instructions manipulate pairs of 64 bit pointers,
+variously copying them, zeroing them, or copying first or second elements of the pair into
+scalar registers.  That probably means we want simple C++ vector manipulation functions in our
+set of custom patterns.
@@ -0,0 +1,140 @@
+---
+title: Whisper Memcpy
+linkTitle: Whisper Memcpy
+weight: 50
+---
+
+Is it easy to recognize vector expansions of libc functions like `memcpy`?
+
+Let's locate some explicit invocations of `memcpy` within Whisper and
+see what the Advisor has to say.
+
+```c++
+struct whisper_context * whisper_init_from_buffer_with_params_no_state(void * buffer, size_t buffer_size, struct whisper_context_params params) {
+    struct buf_context {
+        uint8_t* buffer;
+        size_t size;
+        size_t current_offset;
+    };
+    loader.read = [](void * ctx, void * output, size_t read_size) {
+        buf_context * buf = reinterpret_cast<buf_context *>(ctx);
+
+        size_t size_to_copy = buf->current_offset + read_size < buf->size ? read_size : buf->size - buf->current_offset;
+
+        memcpy(output, buf->buffer + buf->current_offset, size_to_copy);
+        buf->current_offset += size_to_copy;
+
+        return size_to_copy;
+    };
+};
+```
+
+This source example shows a few traits:
+
+* the number of bytes to copy is not in general known at compile time
+* the buffer type is `uint8_t*`
+* there are no alignment guarantees
+
+GCC 15 compiles the lambda stored in loader.read as
+`whisper_init_from_buffer_with_params_no_state::{lambda(void*,void*,unsigned_long)#1}::_FUN`.
+The relevant instruction sequence (trimmed of address and whitespace) is:
+
+```as
+LAB_000b0be2
+    vsetvli  a3,param_3,e8,m8,ta,ma  
+    vle8.v   v8,(a4)
+    c.sub    param_3,a3
+    c.add    a4,a3
+    vse8.v   v8,(param_2)
+    c.add    param_2,a3
+    c.bnez   param_3,LAB_000b0be2
+```
+
+Copying the Ghidra listing to the clipboard and running the Advisor gives us:
+
+```text
+Clipboard Contents to Analyze
+
+LAB_000b0be2                                    XREF[1]:     000b0bf4(j)
+000b0be2 d7 76 36 0c     vsetvli                        a3,param_3,e8,m8,ta,ma
+000b0be6 07 04 07 02     vle8.v                         v8,(a4)
+000b0bea 15 8e           c.sub                          param_3,a3
+000b0bec 36 97           c.add                          a4,a3
+000b0bee 27 84 05 02     vse8.v                         v8,(param_2)
+000b0bf2 b6 95           c.add                          param_2,a3
+000b0bf4 7d f6           c.bnez                         param_3,LAB_000b0be2
+
+Signatures:
+
+    Element width is = 8 bits
+    Vector registers are grouped with MUL = 8
+    Vector load: vle8.v
+        Vector load is to multiple registers
+    Vector store: vse8.v
+        Vector store is from multiple registers
+    At least one loop exists
+    Significant operations, in the order they appear:
+        vsetvli,vle8.v,vse8.v,_loop
+    Significant operations, in alphanumeric order:
+        _loop,vle8.v,vse8.v,vsetvli
+
+Similarity Analysis
+
+Compare the clipped example to the database of vectorized examples.
+
+The best match is id=1873 [1.000]= _loop,vle8.v,vse8.v,vsetvli
+
+The clip is similar to the reference example data/custom_testsuite/builtins/memcpy_rv64gcv:memcpy_255
+Reference C Source
+
+void memcpy_255()
+{
+  __builtin_memcpy (to, from, 255);
+};
+
+Reference Compiled Assembly Code
+
+65e	auipc	a3,0x2
+662	ld	a3,-1678(a3)
+666	auipc	a2,0x0
+66a	addi	a2,a2,82
+66e	li	a4,255
+672	vsetvli	a5,a4,e8,m8,ta,ma
+676	vle8.v	v8,(a2)
+67a	sub	a4,a4,a5
+67c	add	a2,a2,a5
+67e	vse8.v	v8,(a3)
+682	add	a3,a3,a5
+```
+
+The Advisor has matched the vector instruction loop to the GCC `__builtin_memcpy` test case where the
+number of bytes to transfer is large (255).  The individual scalar instructions are not the same.
+
+This example shows something important that we probably want to add to the Advisor's report:
+
+The `vsetvli` instruction includes the `m8` multiplier option, which means vector operations cover groups of 8
+registers.  The `vle8.v` only references vector register `v8`, but the loads and stores affect the 8
+registers `v8` through `v15`.  If the `__builtin_memcpy` appeared in an inline code fragment, where
+there may be more pressure on vector register availability, we might have seen very similar code
+with multipliers of `m4`, `m2`, or `m1`.
+
+What does the Ghidra decompiler show for this instruction sequence?
+
+```c
+  do {
+    lVar3 = vsetvli_e8m8tama(uVar1);
+    auVar4 = vle8_v(lVar2);
+    uVar1 = uVar1 - lVar3;
+    lVar2 = lVar2 + lVar3;
+    vse8_v(auVar4,param_2);
+    param_2 = (void *)((long)param_2 + lVar3);
+  } while (uVar1 != 0);
+```
+
+What would we like Ghidra's decompiler to show instead?  Something like:
+
+```c
+__builtin_memcpy(param_2, lvar2, uVar1);
+```
+
+That's not quite correct, as `__builtin_memcpy` doesn't mutate the values `param_2` or `lvar2`.
@@ -0,0 +1,119 @@
+---
+title: Whisper Output Forensics
+linkTitle: Whisper Output Forensics
+weight: 101
+---
+
+You might expect Whisper to use a lot of vector instructions in its inference engine, and it definitely does.
+Are vector instructions common enough to complicate Whisper forensic analysis, looking at functions an
+adversary is likeliest to target?  For this example we will make a deep dive into the function `output_txt`,
+since malicious code might want to review and alter dictated text.
+
+This code also lets us examine how RISCV vector instructions are used to implement the `libstdc++` vector library functions.
+
+```c++
+const char * whisper_full_get_segment_text(struct whisper_context * ctx, int i_segment) {
+    return ctx->state->result_all[i_segment].text.c_str();
+}
+
+static bool output_txt(struct whisper_context * ctx, const char * fname, const whisper_params & params, std::vector<std::vector<float>> pcmf32s) {
+    std::ofstream fout(fname);
+    if (!fout.is_open()) {
+        fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname);
+        return false;
+    }
+    fprintf(stderr, "%s: saving output to '%s'\n", __func__, fname);
+    const int n_segments = whisper_full_n_segments(ctx);
+    for (int i = 0; i < n_segments; ++i) {
+        const char * text = whisper_full_get_segment_text(ctx, i);
+        std::string speaker = "";
+        if (params.diarize && pcmf32s.size() == 2)
+        {
+            const int64_t t0 = whisper_full_get_segment_t0(ctx, i);
+            const int64_t t1 = whisper_full_get_segment_t1(ctx, i);
+            speaker = estimate_diarization_speaker(pcmf32s, t0, t1);
+        }
+        fout << speaker << text << "\n";
+    }
+    return true;
+}
+```
+
+The key elements of this function are:
+
+* text is collected in segments and stored in the context variable `ctx`
+* text segments are retrieved with the function `whisper_full_get_segment_text`
+* text is copied into an output stream `fout`
+
+The `params.diarize` code block matters only if voice is collected
+in stereo and Whisper has been asked to differentiate between speakers.
+
+The Ghidra decompiler shows four vector instruction sets starting with a `vset*` instruction.
+The first of these is a simple initialization:
+
+```c
+vsetivli_e64m1tama(2);
+vmv_v_i(in_v1,0);
+vse64_v(in_v1,auStack_90);
+vse64_v(in_v1,auStack_80);
+```
+
+These instructions initialize two adjacent 16 byte blocks of memory to zero.  These are likely
+four 64 bit pointers or counters embedded within structures.
+
+The next vector stanza is:
+
+```c
+vsetivli_e64m1tama(2);
+lStack_2e0 = lStack_288;
+uStack_2d8 = local_280[0];
+auVar24 = vle64_v(&lStack_2e0);
+auVar25 = vle64_v(&lStack_2e0);
+auVar24 = vslidedown_vi(auVar24,1);
+lStack_2a8 = vmv_x_s(auVar25);
+local_2a0[0] = vmv_x_s(auVar24);
+```
+
+This one is puzzling, as it appears to load two 64 bit values twice, then store them into separate scalar registers.
+
+The next stanza looks like a simple `memcpy` expansion:
+
+```c
+do {
+    lVar18 = vsetvli_e8m8tama(lStack_288);
+    auVar24 = vle8_v(puVar16);
+    lStack_288 = lStack_288 - lVar18;
+    puVar16 = (ulong *)((long)puVar16 + lVar18);
+    vse8_v(auVar24,puVar20);
+    puVar20 = (ulong *)((long)puVar20 + lVar18);
+} while (lStack_288 != 0);
+```
+
+The final stanza is the interesting one:
+
+```c
+pcVar17 = text;
+do {
+    vsetvli_e8m1tama(0);
+    pcVar17 = pcVar17 + lVar18;
+    auVar24 = vle8ff_v(pcVar17);
+    auVar24 = vmseq_vi(auVar24,0);
+    lVar19 = vfirst_m(auVar24);
+    lVar18 = in_vl;
+    } while (lVar19 < 0);
+std::__ostream_insert<>(pbVar12,text,(long)(pcVar17 + (lVar19 - (long)text)));
+```
+
+This appears to be a vector implementation of `strlen(text)` requested by `std::__ostream_insert<>`.
+
+Our hypothetical adversary would want to evaluate `*text` and reset the `text` pointer to the maliciously altered output string.
+
+The current Advisor classifies these four stanzas as:
+
+* some sort of initializer
+* some sort of shuffle
+* `memcpy`
+* `strlen`
+
+A Ghidra user would likely ignore the initializer and the shuffle as doing something benign and obscure within the I/O subsystem,
+recognize the `memcpy` and `strlen` for what they are, then concentrate on any unexpected manipulations of the `*text` string.
@@ -1,6 +1,6 @@
 ---
 title: Ghidra Advisor Examples
-linkTitle: Documents
+linkTitle: Examples
 weight: 10
 ---