Skip to content

Conversation

@pranavk
Copy link

@pranavk pranavk commented Apr 24, 2025

This is similar to SHF_X86_64_LARGE and allows custom section names to be marked as LARGE and hence moved away to outer edges of the binary to reduce relocation pressure.

This is similar to SHF_X86_64_LARGE and allows custom section names
to be marked as LARGE and hence moved away to outer edges of the
binary to reduce relocation pressure.
@smithp35
Copy link
Contributor

smithp35 commented Apr 25, 2025

Can you give some more details about how this will be used. Thanks to SHF_EXCLUDE being erroneously defined in SHF_MASKPROC we've only got 2 processor specific flags available in the SHF_MASKPROC space and this will take one of them. We've got to be careful that this is the best use of the flag.

If this is going to affect multiple processor architectures and not just AArch64 and x86_64 is a more processor neutral situation more appropriate as each processor only has 3 flags available in practice. Ideally SHF_EXCLUDE should be deprecated and recoded but I can't see that happening with the generic ELF spec in limbo.

For known sections to be moved away at link time, a naming convention could be used. You mention custom sections, is this just to distinguish sections with custom names that do not follow the naming convention to be identified as large?

@MaskRay
Copy link
Contributor

MaskRay commented Apr 25, 2025

(Some notes on https://maskray.me/blog/2023-05-14-relocation-overflow-and-code-models)

During my time at Google, we encountered relocation overflow issues with large x86-64 executables built with specific instrumentations like -fprofile-generate and various -fsanitize=, at optimization levels -O1 and even -O3. (Unoptimized -O0 builds with these instrumentations would have exacerbated the problem.)

Within the large Bazel monorepo, we aimed to implement toolchain settings to mitigate this relocation overflow pressure. I believe that ld.lld --default-script could offer an elegant solution by allowing us to mark specific data sections from instrumentation passes.

Unfortunately, I didn't have the opportunity to deploy this during my tenure at Google. Instead, our approach was to utilize setGlobalVariableLargeSection to set the SHF_X86_64_LARGE flag on certain sections. LLD recognizes this flag and adjusts section placement accordingly.

This section flag oriented choice was primarily because this flag has been available for over a decade and some folks disliked a default linker script.

If I were designing SHF_X86_64_LARGE, I would have been cautious about allocating a bit from the SHF_MASKPROC range (only 4, or 3 if we exclude SHF_EXCLUDE).

Before allocating another bit from SHF_MASKPROC for a different architecture, I would prioritize adding --default-script to the toolchain settings.

@pranavk
Copy link
Author

pranavk commented Apr 25, 2025

For known sections to be moved away at link time, a naming convention could be used. You mention custom sections, is this just to distinguish sections with custom names that do not follow the naming convention to be identified as large?

Yes, for example some Nvidia sections such as nv_fatbin that ideally can be pushed to outer edges of the binary. We leverage SHF_X86_64_LARGE flag right now for such sections to automatically make linker do that to reduce relocation pressure on x86 but that is not possible on Arm.

As Fangrui mentioned, while default linker script can do the job for these custom sections, marking a section as large and letting linker push them to outer edges of binary to reduce relocation pressure, as it does on x86, is more elegant and scalable solution, IMHO. Because nv_fatbin is just one example; there are other custom section names that we would like to handle in this way. I wasn't aware of already-scarce processor-specific section flag bits. So I understand the concerns.

I like the idea of marking such a flag processor-neutral in that case although not all target may need such a flag.

@Wilco1
Copy link
Contributor

Wilco1 commented Apr 25, 2025

Would it be feasible to use say .lbss.nv_fatbin and have the linker recognise the section as large based on the prefix? This was what I originally intended for the medium code model for AArch64. Some targets have .sdata for small data, if this might be useful for our code models, we may want to support these too. The other alternative is to add all this to generic ELF so that we don't reinvent the wheel on several targets.

@pranavk
Copy link
Author

pranavk commented Apr 25, 2025

Would it be feasible to use say .lbss.nv_fatbin

I am not sure if runtime would be okay for these sections to be renamed.

@smithp35
Copy link
Contributor

The two cases I'm most familiar for when the precise section name is used is:

  • When the same program uses a linker defined __start and __stop symbol name to get the base/limit of the section name. For instrumentation it would mean that any runtime library would need to be updated to match the code-generators choice of name.
  • When tools like objcopy are used to extract sections from the binary which looking at clang's comment is what is happening here [1].

As I understand it, marking a section SHF_*_LARGE flag on a section is not really a property of the section itself, but more an assertion/promise that all the sections that reference it, have been compiled with a suitably long-ranged code-model.

I can see that adding a flag makes it easier for linkers to do the right thing (without needing a linker script) for sections that cannot conform to the required naming conventions. However I'm not yet convinced it is worth burning 50% of our remaining processor section flag space for it.

Given that this should affect any architecture that has needed to define code-models (x86_64, AArch64, RISCV64 etc.) then I think it would be best to try and find an architecture neutral identifier. If we can't use a section flag in the generic or OS space, there may be other less elegant, but acceptable alternatives to marking a section as large.

[1]

    else
      FatbinConstantName =
          CGM.getTriple().isMacOSX() ? "__NV_CUDA,__nv_fatbin" : ".nv_fatbin";
    // NVIDIA's cuobjdump looks for fatbins in this section.

@Wilco1
Copy link
Contributor

Wilco1 commented Apr 28, 2025

As I understand it, marking a section SHF_*_LARGE flag on a section is not really a property of the section itself, but more an assertion/promise that all the sections that reference it, have been compiled with a suitably long-ranged code-model.

Since the section would be extern, one can always refer to it using PIC/PIE addressing without needing a large code model (which currently does not support PIC/PIE).

In principle the linker could automatically detect huge sections and sort them differently. This would avoid the majority of scenarios where people run into relocation range issues due to a huge section or array.

@smithp35
Copy link
Contributor

In principle the linker could automatically detect huge sections and sort them differently. This would avoid the majority of scenarios where people run into relocation range issues due to a huge section or array.

I think that could work for AArch64 although it likely end up scanning relocations and marking data sections that are "not large" (contain a non GOT generating ADRP/ADR/LDR relocation to a symbol defined by the section). What remains are the sections that can be moved to the end of the program. Not sure how easily that would generalise to other architectures though.

@Wilco1
Copy link
Contributor

Wilco1 commented Apr 28, 2025

I was thinking "any section > 1GB goes at the end" by default, but checking relocations would be even better. Then you could stop worrying about code models and just emit GOT relocations and let the linker deal with it (including optimizing them back to ADRP if in range).

@MaskRay
Copy link
Contributor

MaskRay commented Apr 28, 2025

(I am still travelling with limited computer access)

I understand and agree that the pushback to a section flag is reasonable.
I was also nervous when the relevant LLVM instrumentation patches landed.

Building on my previous comment,

I believe that ld.lld --default-script could offer an elegant solution by allowing us to mark specific data sections from instrumentation passes.

I strongly recommend that Google tests this before adding an ELF marker mechanism to LLVM/assembler/linker.
Additionally, the --default-script option could be leveraged to customize alignment in relation to hugepages.

@pranavk
Copy link
Author

pranavk commented Apr 29, 2025

Thanks for discussion. I agree and am convinced that it's not ideal to burn processor-specific flags for this.

That leaves us with few options:

  1. Have a processor neutral flag for this.
  2. Have some logic in linker to identify large sections and reorder appropriately.
  3. Use linker scripts.

I am somewhat inclined towards (1).

We have looked into (3) and I think it has its limitations (without improving support for linker scripts) -- for example, INSERT A BEFORE B. What if B doesn't exist in some binary -- linker would error out in that case.

@rnk
Copy link

rnk commented Apr 29, 2025

I believe that ld.lld --default-script could offer an elegant solution by allowing us to mark specific data sections from instrumentation passes.
Additionally, the --default-script option could be leveraged to customize alignment in relation to hugepages.

Google actually has this default linker script now precisely for the purpose of aligning program segments to huge page boundaries, but I would say that the usability of it so far has been pretty poor. I should probably read more about linker script features and syntax, but I think the whole "insert after/before" model introduces a lot of program-global dependencies on the existence of non-existence of sections like .interp, .eh_frame, etc, and we have many cases where these sections may or may not be present depending on the input sections. I think we'd be better served by associating well-known section names (nvfatbin, interp, etc) to order numbers, and letting the linker sort input sections on those numbers.

The .init_array priority numbering scheme comes to mind for me as perhaps possible prior art for controlling section ordering, but it's not great.

Mainly, though, I take the point that processor-specific flags are scarce, and we can drop the case for arm flags.

I do think it is worth going down the path of an ISA / OS neutral section flag, since after quickly glancing at LLVM ELF.h, they appear less scarce, and the flag encodes a promise that generated code will use general access patterns, i.e. a GOT load in the general case, or whatever pattern is used for a cross-DSO reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants