A repository to work on a part of the Jak X decompilation, with the purpose of porting the game to PC. This port requires the decompilation of the C++ kernel and the GOAL code that is ran by that kernel.
This project's primary purpose is to provide for the game/jakx and common/jakx C++ (kernel) code for the OpenGoal Project, necessary to run Jak X. As the game also useses networking, the secondary goal is to reverse the rest of the (SCE-RT) functions' labels (Medius, etc.) so that the game can hopefully be connected to unofficial fan-hosted servers. Even if we'd step away from Medius code in the client-side of the game, it's still useful to see what the names are of network related C function calls in GOAL code during GOAL decompilation. This way, people decompiling that GOAL code can better guess/understand what that GOAL code calling those functions is doing.
As said, this project will focus on the C++ part of the code base, with the intention of merging/bringing it later back into the OpenGOAL project. Regarding the part of the codebase that consist of GOAL code, initial work has already been done to start decompilation in this pull request.
The following code can be found in this project:
elf/kernel/jakxandelf/kernel/common: The Jak X C++ code. For now, this is still pseudo-code that has to be converted to valid and buildable C++ code. Next, it will have to be debugged so that it behaves as expected or similarly to Jak3's kernel, where applicable.elf/cpp-dump: The pseudo code generated by Ghidra with all symbols we have added --- months of work pourred in, so it is often nothing near the original ELF's export. This is useful to have some reference of the binary, in case that'd be necessary and I'd be unavailable.
Using one of the scripts, you can generate an exhaustive label overview of the ELF. There are 4512 occurrences of the pattern "function" in that dump, which should be the exact number of functions reversed. There might be a handful of functions that have not been disassembled (discovered) yet, but that number should be low. There are 1310 occurrences of the regex pattern "FUN_........", hence a third of all functions have no information at all on known functions (yet). There are 1659 occurrences of the regex pattern ".*FUN_.........*", so additional information is available on about 300 of them, and this should be the number of functions that could not be matched. They are however not necessary to port the game, but they might make it a little bit easier.
In order to make sense of the ELF, I've been primarily adding symbol names, which is what I refer to as "symbol matching". Unlike efficient approaches, I've been working from the ground up: I've invested time in adding as much symbols as possible before trying to export decompiled code.
I matched against several bsim servers that held the definitions of the following games. I looked at games around the same date that would have a shared codebase (i.e., Jak and Daxter, obvisouly, Medius, or simply around 2005, the release of Jak X). Retro Reversing has listed a few PS2 games with unstripped symbols and PS2 demos with symbols.
- My Street (March 9, 2003)
- demo of Ratchet and Clank: Deadlocked (September 2005)
- demo of Jak 3 (August 24, 2004)
- demo of God of War II (May 5, 2006, or 27 February, 2007)
Next to that, I also got symbols from the PS2SDK project, where I could compare strings at best, or compare enums or other variables at worst. Sometimes I also copied over signatures from there.
My naming scheme changed over time as I noticed I needed to be more precise on where my symbols came from and how well I could trust them. This means that I cannot describe something here that will definitely fit all situations. I usually copied over the names and added a suffix, i.e. foo_G.
_G: Usually, I find these symbol names in other guess symbols, but I'm not entirely sure as they are not matching perfectly. Careful though, the names may also be completely made up, so check the above referene symbols. With global variables, this is usually made clear by usingALL_CAPS_STYLE_G, but not consistent._S: These symbol names are based on a string. This means I'm already confident I'm correct (I used to use_Gor_Qfor this as well). Later on, I typically remove these labels either way. Some names cannot be verified and in those cases, I would rarely removed the suffix. (Example:FUN_00133cc8_addPurchasecallsaddPurchase_Sand also prints"addPurchase error"after checking its result.)_Q: These symbol names are either guessed by matching functions recursively in BSIM search windows, or based on strings. In general, I was moderatly confident they were right, but wanted to come across another occurrence I could verify to be absolutely sure. I later started to use_Sfor string sources that would give away a name._W: I'm guessing a name wildly, based on some function body or data structure that is related somewhere. (I used to use_Gfor this as well.)_T: The source is one of the tables; depending on how certain I am, it will be combined withW,Gor nothing._M: These symbols are usually structures that were given away by the memory dump and otherwise hardly visible.- no prefix: This may mean I'm sure it's correct or that I named the symbol that way early on when I wasn't careful and it might even be made up, or simply a guess.
The address of the binary dump from PCSX2 and the decrypted ELF match exactly. As most of the initial work was still in the memory dump, which was the most relevant for the decompilation of the game/jakx and common/jakx C++ code, I created a few scripts to go through the code and usually interactively ask whether to override a symbol or not.
To execute them, I simply copied over the script's code into Ghidraton (but the Python Window should work to with some small changes, normally). Most scripts are horribly coded but work fine, and allow you to quit execution at any time. It is recommended though to minimally understand what the scripts do, after all they're small anyway. If you're not familiar with the Ghidra API, some knowledge on them could always be handy --- you could ask Perplexity/ChatGPT, as they surprisingly know the API very well!
Note however that a range of sce functions have an 8 byte address mismatch to my memory dump, for some reason. That might be a bug in the function label porting script, where I used +/- to offset and navigate the code, so it's possible you don't come across that bug.
**************************************************************
* FUNCTION *
**************************************************************
int __stdcall sceFsReset(void)
int v0_lo:4 <RETURN>
undefined8 Stack[-0x10]:8 local_10 XREF[2]: 001192cc(W),
001192e4(R)
sceFsReset XREF[1]: InitIOP:00269774(c)
001192c0 f0 ff bd 27 addiu sp,sp,-0x10
001192c4 1f 00 02 3c lui v0,0x1f
sceFsReset
001192c8 20 00 04 3c lui a0,0x20
001192cc 00 00 bf ff sd ra,0x0(sp)=>local_10
In one of the scripts, you might be getting this error when you try to apply a signature override in Ghidra. (I came across it when doing this manually.) This should occur whenever the function call has a BLUE label "ptr_addr1_addr2". If it's either WHITE or simply "LAB_addr", then it's fine and shouldn't happen.
Error overriding signature: ghidra.util.exception.InvalidInputException: DataTypeSymbol has a reference
---------------------------------------------------
Build Date: 2024-Jun-07 1416 EDT
Ghidra Version: 11.1
Java Home: C:\Program Files\Eclipse Adoptium\jdk-17.0.11.9-hotspot
JVM Version: Eclipse Adoptium 17.0.11
OS: Windows 10 10.0 amd64
Workstation: REUBUS
To resolve the above error from appearing (for that function call signature override), simply remove the ptr label, so that it will be a blue "LAB_addr" label, and try again. You might be able to execute the signature override, I wasn't as I think I ran against a bug that might get fixed in the future. (The code is of course perfectly fine ;p)
From what I understand from Perplexity, string tables are used for symbol resolution and to serve as debugging information. The ones I found all 10 reside in .text, but sadly, somewhere around function entry, they stopped appearing. I have labeled the start of the string tables with CPP_FILE.
I've used a prompt to apply these names as Perplexity is very good at transforming these strings into signatures, but they require double checks. Additionally, the return types appear not to be reliable --- is it even part of those strings? Further down in this conversation, you can find a few examples that are useful to learn how to interpret these strings. The prompt that gave okay results is the following:
I'm trying to figure out the signature of a symbol that I found in a C++ string table for symbol resolution and debugging. What would be the signature of the following mangled name: _videoCallbackEP7sceMpegP16sceMpegCbDataStrPv.
Large tasks:
- Although I don't expect to gain much from it, one can try to match the functions against those of Jak 1 or Jak 2.
-
gcc2_compiled.functions in other (demo) games have additional labels that give away their names, apparently. I noticed this too late, but it would be helpful to locate other nameless functions if we can match these gcc functions. - Apply mangled symbol names from tables (under
CPP_FILE) if reliable. - Find source of orphaned strings (
001eba50,001e78b0,001e78d0, ``)
Less important details to check:
- Why is
IOP_MODULE_DATA_WSnot referenced by_AddModuleArgs? It occurs in the function's body. - Is
DAT_001f63e0(or lower) an array of thread ids? - What are these functions for?
DAT_001f5b78_func1 = 0; DAT_001f5b7c_func2 = 0; DAT_001f5b80_func3 = 0;
- Compare print functions with new sources. NOTE: the print functions are a mess, don't try to fix their names, as sources will contradict. (For example,
fiprintfin all binaries call each_vfiprintf_r, but differently.) - Iterate over all matches of the regex pattern
0x(1|2)[0-9a-f]{5}to find addresses that should be labeled instead. Currently, there are 787 in the decrypted ELF. -
MC_runis very large in Jak 3, but consists of 4 function calls in Jak X. This might also expose the missing functions:mc_get_filename,mc_get_filename_no_dir,mc_print,mc_get_total_bank_size,mc_checksumif they exist. (Nobank0,1,2...found, but asave1,2,3,4does exist.)void MC_run(void) { s32 sema_id = DAT_002d3908_mc_sema_id; WaitSema(DAT_002d3908_mc_sema_id); FUN_002730e4(); // only also called in FUN_0027387c FUN_002732e4(); // only also called in FUN_0027387c too SignalSema(sema_id); }
- Attempt to find
cb_reprobe_format,cb_unformat,cb_reprobe_createfile,cb_reprobe_save, ... (some of which are listed in kmemcard.h). - Find the exact start of the array at 00283740 of 0x8c0 sized elements? Also see
sceMc2Init_G_ProxywhereuVar6 * 0x230as well asuVar6 * 0x8c0occur. - Find the exact start of the array over 00283864, 00283874 of 0x230 sized elements (see
mc_get_secrets_S) - Find the exact start of the array over 00283860 of 0x230 sized elements (see
MC_shutdown_G) - Find usage of
mc_slot_infoandmc_file_info - ... many more that I forgot to write down.
Exporting C++ code is easy with Ghidra's options, but there is only so much you can clean up in Ghidra --- the code is far from perfect. In this phase, I tried to map the content of each function as much as possible with the existing code base from Jak 3.
Methodology for this was to refactor the functions one by one, and if code was moved around or non-trivial changes were made, this was split up in different commits, to back track in case I messed up. I'd recommend splitting in these situations:
- When moving code under a label to each
goto, separate for each label/block. Bring over all the remainders of the code execution until you hit either the end of a parent loop, a return statement or the end of the function. - Moving variable declarations.
- Converting/Forming for or while loops.
After this is done.
- Add
Ptr<>or other OpenGOAL++ types. - Fix header files where needed.
- Compile and run
jak3code. - Compile
jakxcode. - Debug
jakxcode, comparing withjak3's kernel.
- I dumped the PAL game's EE memory using PCSX2
- I added a ton of debugging symbols from other games (such as R&C for SCE-RT) by comparing the functions
- I added types (structs, enums, etc) where possible.
- At that point, I could just reimplement the engine, because it was reversed sufficiently and I can compare with jak 3's implementation.
- However, I wanted to squize out all information I could find, so I kept on reversing. After all, I reasoned that for the online component, it was necessary to understand what calls were made, as well as to make it easier to reverse the goal code.
- Then I got fed up that I was searching through goal code a few times without realizing it, so I decided to look again at the original, encrypted PAL version. There, I recognized a function I had come across already in my memory dump (through other games). That's how I was able to decrypt the game's ELF, with the help of Ziemas. The "decryption" prototype can be found on Github.
- Then, in December 2024, I started creating scripts to port the symbols from the memory dump to the decrypted ELF.
- In January, I started exporting the C++ code. This took much longer than expected, as for some functions, some structure types were not discovered / applied, and other were very complex and/or different to their Jak 3 version. I first modified the functions gradually so they appear the most like the current Jak 3 functions, whilst retaining their execution paths.
- In mid February, all the functions have been refactored to resemble the
jak3version as much as possible, ignoring a few functions that are not necessary or probably the same in previous versions. Currently some rough edges still need to be handled before building is possible.