AMDGPU: Handle rewriting non-tied MFMA to AGPR form #153015

arsenm · 2025-08-11T14:34:52Z

If src2 and dst aren't the same register, to fold a copy
to AGPR into the instruction we also need to reassign src2
to an available AGPR. All the other uses of src2 also need
to be compatible with the AGPR replacement in order to avoid
inserting other copies somewhere else.

Perform this transform, after verifying all other uses are
compatible with AGPR, and have an available AGPR available at
all points (which effectively means rewriting a full chain of
mfmas and load/store at once).

arsenm · 2025-08-11T14:35:09Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-08-11T14:35:45Z

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

If src2 and dst aren't the same register, to fold a copy
to AGPR into the instruction we also need to reassign src2
to an available AGPR. All the other uses of src2 also need
to be compatible with the AGPR replacement in order to avoid
inserting other copies somewhere else.

Perform this transform, after verifying all other uses are
compatible with AGPR, and have an available AGPR available at
all points (which effectively means rewriting a full chain of
mfmas and load/store at once).

Patch is 147.87 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153015.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp (+206-119)
(modified) llvm/test/CodeGen/AMDGPU/inflate-reg-class-vgpr-mfma-to-av-with-load-source.mir (+27-35)
(added) llvm/test/CodeGen/AMDGPU/jeff_illegal_reduced.ll (+541)
(modified) llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr.ll (+140-14)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp b/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp
index bab83483f3de1..8c6aee89b7375 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp
@@ -14,12 +14,7 @@
 /// MFMA opcode.
 ///
 /// TODO:
-///  - Handle non-tied dst+src2 cases. We need to try to find a copy from an
-///    AGPR from src2, or reassign src2 to an available AGPR (which should work
-///    in the common case of a load).
-///
-///  - Handle multiple MFMA uses of the same register. e.g. chained MFMAs that
-///    can be rewritten as a set
+///  - Handle SplitKit partial copy bundles, and not just full copy instructions
 ///
 ///  - Update LiveIntervals incrementally instead of recomputing from scratch
 ///
@@ -49,66 +44,149 @@ class AMDGPURewriteAGPRCopyMFMAImpl {
   VirtRegMap &VRM;
   LiveRegMatrix &LRM;
   LiveIntervals &LIS;
+  const RegisterClassInfo &RegClassInfo;
+
+  bool attemptReassignmentsToAGPR(SmallSetVector<Register, 4> &InterferingRegs,
+                                  MCPhysReg PrefPhysReg) const;
 
 public:
   AMDGPURewriteAGPRCopyMFMAImpl(MachineFunction &MF, VirtRegMap &VRM,
-                                LiveRegMatrix &LRM, LiveIntervals &LIS)
+                                LiveRegMatrix &LRM, LiveIntervals &LIS,
+                                const RegisterClassInfo &RegClassInfo)
       : ST(MF.getSubtarget<GCNSubtarget>()), TII(*ST.getInstrInfo()),
         TRI(*ST.getRegisterInfo()), MRI(MF.getRegInfo()), VRM(VRM), LRM(LRM),
-        LIS(LIS) {}
-
-  // TODO: Remove this restriction
-  bool mfmaHasSameSrc2AndDstReg(const MachineInstr &MI) const {
-    const MachineOperand *Src2 = TII.getNamedOperand(MI, AMDGPU::OpName::src2);
-    const MachineOperand *Dst = TII.getNamedOperand(MI, AMDGPU::OpName::vdst);
-    return Src2->getReg() == Dst->getReg() &&
-           Src2->getSubReg() == Dst->getSubReg();
-  }
+        LIS(LIS), RegClassInfo(RegClassInfo) {}
 
   bool isRewriteCandidate(const MachineInstr &MI) const {
-    return TII.isMAI(MI) &&
-           AMDGPU::getMFMASrcCVDstAGPROp(MI.getOpcode()) != -1 &&
-           mfmaHasSameSrc2AndDstReg(MI);
+    return TII.isMAI(MI) && AMDGPU::getMFMASrcCVDstAGPROp(MI.getOpcode()) != -1;
   }
 
   /// Compute the register class constraints based on the uses of \p Reg,
   /// excluding MFMA uses from which can be rewritten to change the register
   /// class constraint. This should be nearly identical to
   /// MachineRegisterInfo::recomputeRegClass.
-  const TargetRegisterClass *
-  recomputeRegClassExceptRewritable(Register Reg,
-                                    const TargetRegisterClass *OldRC,
-                                    const TargetRegisterClass *NewRC) const;
+
+  /// \p RewriteCandidates will collect the set of MFMA instructions that need
+  /// to have the opcode mutated to perform the replacement.
+  ///
+  /// \p RewriteRegs will accumulate the set of register used by those MFMAs
+  /// that need to have the register classes adjusted.
+  const TargetRegisterClass *recomputeRegClassExceptRewritable(
+      Register Reg, const TargetRegisterClass *OldRC,
+      const TargetRegisterClass *NewRC,
+      SmallVectorImpl<MachineInstr *> &RewriteCandidates,
+      SmallSetVector<Register, 4> &RewriteRegs) const;
 
   bool run(MachineFunction &MF) const;
 };
 
 const TargetRegisterClass *
 AMDGPURewriteAGPRCopyMFMAImpl::recomputeRegClassExceptRewritable(
-    Register Reg, const TargetRegisterClass *OldRC,
-    const TargetRegisterClass *NewRC) const {
-
-  // Accumulate constraints from all uses.
-  for (MachineOperand &MO : MRI.reg_nodbg_operands(Reg)) {
-    // Apply the effect of the given operand to NewRC.
-    MachineInstr *MI = MO.getParent();
-
-    // We can swap the classes of dst + src2 as a pair to AGPR, so ignore the
-    // effects of rewrite candidates. It just so happens that we can use either
-    // AGPR or VGPR in src0/src1, so don't bother checking the constraint
-    // effects of the individual operands.
-    if (isRewriteCandidate(*MI))
-      continue;
+    Register StartReg, const TargetRegisterClass *OldRC,
+    const TargetRegisterClass *NewRC,
+    SmallVectorImpl<MachineInstr *> &RewriteCandidates,
+    SmallSetVector<Register, 4> &RewriteRegs) const {
+  SmallVector<Register, 8> Worklist = {StartReg};
+
+  // Recursively visit all transitive MFMA users
+  while (!Worklist.empty()) {
+    Register Reg = Worklist.pop_back_val();
+    // Accumulate constraints from all uses.
+    for (MachineOperand &MO : MRI.reg_nodbg_operands(Reg)) {
+      // Apply the effect of the given operand to NewRC.
+      MachineInstr *MI = MO.getParent();
+
+      // We can swap the classes of dst + src2 as a pair to AGPR, so ignore the
+      // effects of rewrite candidates. It just so happens that we can use
+      // either AGPR or VGPR in src0/src1, so don't bother checking the
+      // constraint effects of the individual operands.
+      if (isRewriteCandidate(*MI)) {
+        for (AMDGPU::OpName OpName :
+             {AMDGPU::OpName::vdst, AMDGPU::OpName::src2}) {
+          const MachineOperand *Op = TII.getNamedOperand(*MI, OpName);
+          if (!Op->isReg())
+            continue;
+
+          Register OtherReg = Op->getReg();
+          if (OtherReg != Reg) {
+            if (RewriteRegs.insert(OtherReg))
+              Worklist.push_back(OtherReg);
+          }
+        }
+
+        LLVM_DEBUG(dbgs() << "Ignoring effects of " << *MI);
+
+        if (!is_contained(RewriteCandidates, MI))
+          RewriteCandidates.push_back(MI);
+
+        continue;
+      }
 
-    unsigned OpNo = &MO - &MI->getOperand(0);
-    NewRC = MI->getRegClassConstraintEffect(OpNo, NewRC, &TII, &TRI);
-    if (!NewRC || NewRC == OldRC)
-      return nullptr;
+      unsigned OpNo = &MO - &MI->getOperand(0);
+      NewRC = MI->getRegClassConstraintEffect(OpNo, NewRC, &TII, &TRI);
+      if (!NewRC || NewRC == OldRC) {
+        LLVM_DEBUG(dbgs() << "User of " << printReg(Reg, &TRI)
+                          << " cannot be reassigned to AGPR: " << *MI);
+        return nullptr;
+      }
+    }
   }
 
   return NewRC;
 }
 
+/// Attempt to reassign the registers in \p InterferingRegs to be AGPRs, with a
+/// preference to use \p PhysReg first. Returns false if the reassignments
+/// cannot be trivially performed.
+bool AMDGPURewriteAGPRCopyMFMAImpl::attemptReassignmentsToAGPR(
+    SmallSetVector<Register, 4> &InterferingRegs, MCPhysReg PrefPhysReg) const {
+  // FIXME: The ordering may matter here, but we're just taking uselistorder
+  // with the special case of ensuring to process the starting instruction
+  // first. We probably should extract the priority advisor out of greedy and
+  // use that ordering.
+  for (Register InterferingReg : InterferingRegs) {
+    LiveInterval &ReassignLI = LIS.getInterval(InterferingReg);
+    const TargetRegisterClass *EquivalentAGPRRegClass =
+        TRI.getEquivalentAGPRClass(MRI.getRegClass(InterferingReg));
+
+    MCPhysReg Assignable = AMDGPU::NoRegister;
+    if (EquivalentAGPRRegClass->contains(PrefPhysReg) &&
+        LRM.checkInterference(ReassignLI, PrefPhysReg) ==
+            LiveRegMatrix::IK_Free) {
+      // First try to assign to the AGPR we were already copying to. This
+      // should be the first assignment we attempt. We have to guard
+      // against the use being a subregister (which doesn't have an exact
+      // class match).
+
+      // TODO: If this does happen to be a subregister use, we should
+      // still try to assign to a subregister of the original copy result.
+      Assignable = PrefPhysReg;
+    } else {
+      ArrayRef<MCPhysReg> AllocOrder =
+          RegClassInfo.getOrder(EquivalentAGPRRegClass);
+      for (MCPhysReg Reg : AllocOrder) {
+        if (LRM.checkInterference(ReassignLI, Reg) == LiveRegMatrix::IK_Free) {
+          Assignable = Reg;
+          break;
+        }
+      }
+    }
+
+    if (!Assignable) {
+      LLVM_DEBUG(dbgs() << "Unable to reassign VGPR "
+                        << printReg(InterferingReg, &TRI)
+                        << " to a free AGPR\n");
+      return false;
+    }
+
+    LLVM_DEBUG(dbgs() << "Reassigning VGPR " << printReg(InterferingReg, &TRI)
+                      << " to " << printReg(Assignable, &TRI) << '\n');
+    LRM.assign(ReassignLI, Assignable);
+  }
+
+  return true;
+}
+
 bool AMDGPURewriteAGPRCopyMFMAImpl::run(MachineFunction &MF) const {
   // This only applies on subtargets that have a configurable AGPR vs. VGPR
   // allocation.
@@ -145,7 +223,6 @@ bool AMDGPURewriteAGPRCopyMFMAImpl::run(MachineFunction &MF) const {
 
     LiveInterval &LI = LIS.getInterval(VReg);
 
-    // TODO: Test multiple uses
     for (VNInfo *VNI : LI.vnis()) {
       MachineInstr *DefMI = LIS.getInstructionFromIndex(VNI->def);
 
@@ -154,55 +231,50 @@ bool AMDGPURewriteAGPRCopyMFMAImpl::run(MachineFunction &MF) const {
       if (!DefMI || !DefMI->isFullCopy())
         continue;
 
-      Register CopySrcReg = DefMI->getOperand(1).getReg();
-      if (!CopySrcReg.isVirtual())
+      Register MFMADstReg = DefMI->getOperand(1).getReg();
+      if (!MFMADstReg.isVirtual())
         continue;
 
-      LiveInterval &CopySrcLI = LIS.getInterval(CopySrcReg);
+      LiveInterval &CopySrcLI = LIS.getInterval(MFMADstReg);
       LiveQueryResult LRQ = CopySrcLI.Query(VNI->def.getRegSlot());
-      MachineInstr *CopySrcMI = LIS.getInstructionFromIndex(LRQ.valueIn()->def);
-      if (!CopySrcMI)
+      MachineInstr *MFMA = LIS.getInstructionFromIndex(LRQ.valueIn()->def);
+      if (!MFMA || !isRewriteCandidate(*MFMA))
         continue;
 
-      int AGPROp = AMDGPU::getMFMASrcCVDstAGPROp(CopySrcMI->getOpcode());
-      if (AGPROp == -1)
+      MachineOperand *Src2 = TII.getNamedOperand(*MFMA, AMDGPU::OpName::src2);
+      if (!Src2->isReg())
         continue;
 
-      MachineOperand *Src2 =
-          TII.getNamedOperand(*CopySrcMI, AMDGPU::OpName::src2);
+      Register Src2Reg = Src2->getReg();
+      if (!Src2Reg.isVirtual())
+        continue;
 
       // FIXME: getMinimalPhysRegClass returns a nonsense AV_* subclass instead
       // of an AGPR or VGPR subclass, so we can't simply use the result on the
       // assignment.
 
       LLVM_DEBUG({
-        Register Src2PhysReg = VRM.getPhys(Src2->getReg());
         dbgs() << "Attempting to replace VGPR MFMA with AGPR version:"
                << " Dst=[" << printReg(VReg) << " => "
-               << printReg(PhysReg, &TRI) << "], Src2=["
-               << printReg(Src2->getReg(), &TRI) << " => "
-               << printReg(Src2PhysReg, &TRI) << "]: " << *CopySrcMI;
+               << printReg(PhysReg, &TRI) << ']';
+
+        if (Src2Reg) {
+          Register Src2PhysReg = VRM.getPhys(Src2Reg);
+          dbgs() << ", Src2=[" << printReg(Src2Reg, &TRI) << " => "
+                 << printReg(Src2PhysReg, &TRI) << "]: " << *MFMA;
+        }
       });
 
-      // If the inputs are tied and the same register, we can shortcut and
-      // directly replace the register.
-      if (!Src2->isReg() || Src2->getReg() != CopySrcReg ||
-          Src2->getSubReg() != DefMI->getOperand(1).getSubReg()) {
-        LLVM_DEBUG(
-            dbgs()
-            << "Replacing untied VGPR MFMAs with AGPR form not yet handled\n");
-        // TODO: Only handles the tied case for now. If the input operand is a
-        // different register, we need to also reassign it (either by looking
-        // for a compatible copy-from-AGPR, or by seeing if an available AGPR is
-        // compatible with all other uses.
-
-        // If we can't reassign it, we'd need to introduce a different copy
-        // which is likely worse than the copy we'd be saving.
-        continue;
-      }
+      const TargetRegisterClass *DstVirtRegRC = MRI.getRegClass(MFMADstReg);
 
-      const TargetRegisterClass *Src2VirtRegRC =
-          MRI.getRegClass(Src2->getReg());
+      // src2 and dst have the same physical class constraint; try to preserve
+      // the original src2 subclass if one were to exist.
+      SmallVector<MachineInstr *, 4> RewriteCandidates = {MFMA};
+      SmallSetVector<Register, 4> RewriteRegs;
+
+      // Make sure we reassign the MFMA we found the copy from first. We want
+      // to ensure dst ends up in the physreg we were originally copying to.
+      RewriteRegs.insert(MFMADstReg);
 
       // We've found av = COPY (MFMA), and need to verify that we can trivially
       // rewrite src2 to use the new AGPR. If we can't trivially replace it,
@@ -210,61 +282,71 @@ bool AMDGPURewriteAGPRCopyMFMAImpl::run(MachineFunction &MF) const {
       // first place, as well as need to assign another register, and need to
       // figure out where to put them. The live range splitting is smarter than
       // anything we're doing here, so trust it did something reasonable.
-      const TargetRegisterClass *Src2ExceptRC =
-          recomputeRegClassExceptRewritable(Src2->getReg(), Src2VirtRegRC,
-                                            VirtRegRC);
-      if (!Src2ExceptRC) {
-        LLVM_DEBUG(dbgs() << "Could not recompute the regclass\n");
+      //
+      // Note recomputeRegClassExceptRewritable will consider the constraints of
+      // this MFMA's src2 as well as the src2/dst of any transitive MFMA users.
+      const TargetRegisterClass *DstExceptRC =
+          recomputeRegClassExceptRewritable(MFMADstReg, DstVirtRegRC, VirtRegRC,
+                                            RewriteCandidates, RewriteRegs);
+      if (!DstExceptRC) {
+        LLVM_DEBUG(dbgs() << "Could not recompute the regclass of "
+                          << printReg(MFMADstReg, &TRI) << '\n');
         continue;
       }
 
-      const TargetRegisterClass *NewSrc2ConstraintRC =
-          TII.getRegClass(TII.get(AGPROp), Src2->getOperandNo(), &TRI, MF);
-
-      // Try to constrain src2 to the replacement instruction candidate's
-      // register class.
-      const TargetRegisterClass *NewSrc2RC =
-          TRI.getCommonSubClass(Src2ExceptRC, NewSrc2ConstraintRC);
-      if (!NewSrc2RC) {
-        LLVM_DEBUG(dbgs() << "Other uses of " << printReg(Src2->getReg(), &TRI)
-                          << " are incompatible with replacement class\n");
-        continue;
+      // If src2 and dst are different registers, we need to also reassign the
+      // input to an available AGPR if it is compatible with all other uses.
+      //
+      // If we can't reassign it, we'd need to introduce a different copy
+      // which is likely worse than the copy we'd be saving.
+      //
+      // It's likely that the MFMA is used in sequence with other MFMAs; if we
+      // cannot migrate the full use/def chain of MFMAs, we would need to
+      // introduce intermediate copies somewhere. So we only make the
+      // transform if all the interfering MFMAs can also be migrated. Collect
+      // the set of rewritable MFMAs and check if we can assign an AGPR at
+      // that point.
+      //
+      // If any of the MFMAs aren't reassignable, we give up and rollback to
+      // the original register assignments.
+
+      using RecoloringStack =
+          SmallVector<std::pair<const LiveInterval *, MCRegister>, 8>;
+      RecoloringStack TentativeReassignments;
+
+      for (Register RewriteReg : RewriteRegs) {
+        LiveInterval &LI = LIS.getInterval(RewriteReg);
+        TentativeReassignments.push_back({&LI, VRM.getPhys(RewriteReg)});
+        LRM.unassign(LI);
       }
 
-      MRI.setRegClass(VReg, AssignedRC);
-      MRI.setRegClass(Src2->getReg(), NewSrc2RC);
-
-      CopySrcMI->setDesc(TII.get(AGPROp));
-
-      // Perform replacement of the register, rewriting the rewritable uses.
-      for (MachineInstr &UseMI :
-           make_early_inc_range(MRI.reg_instructions(CopySrcReg))) {
-        if (TII.isMAI(UseMI)) {
-          // Note the register we need to rewrite may still appear in src0/src1,
-          // but that's fine since those can use A or V anyway.
-          int ReplacementOp = AMDGPU::getMFMASrcCVDstAGPROp(UseMI.getOpcode());
-          if (ReplacementOp != -1)
-            UseMI.setDesc(TII.get(ReplacementOp));
+      if (!attemptReassignmentsToAGPR(RewriteRegs, PhysReg)) {
+        // Roll back the register assignments to the original state.
+        for (auto [LI, OldAssign] : TentativeReassignments) {
+          if (VRM.hasPhys(LI->reg()))
+            LRM.unassign(*LI);
+          LRM.assign(*LI, OldAssign);
         }
 
-        UseMI.substituteRegister(CopySrcReg, VReg, AMDGPU::NoSubRegister, TRI);
+        continue;
       }
 
-      LLVM_DEBUG(dbgs() << "Replaced VGPR MFMA with AGPR: " << *CopySrcMI);
-
-      // We left behind an identity copy, so delete it.
-      LIS.RemoveMachineInstrFromMaps(*DefMI);
-      DefMI->eraseFromParent();
-
-      LRM.unassign(CopySrcLI);
+      // Fixup the register classes of the virtual registers now that we've
+      // committed to the reassignments.
+      for (Register InterferingReg : RewriteRegs) {
+        const TargetRegisterClass *EquivalentAGPRRegClass =
+            TRI.getEquivalentAGPRClass(MRI.getRegClass(InterferingReg));
+        MRI.setRegClass(InterferingReg, EquivalentAGPRRegClass);
+      }
 
-      // We don't need the liveness information anymore, so don't bother
-      // updating the intervals. Just delete the stale information.
-      // TODO: Is it worth preserving these?
-      LIS.removeInterval(CopySrcReg);
-      LIS.removeInterval(VReg);
-      LIS.createAndComputeVirtRegInterval(VReg);
+      for (MachineInstr *RewriteCandidate : RewriteCandidates) {
+        int NewMFMAOp =
+            AMDGPU::getMFMASrcCVDstAGPROp(RewriteCandidate->getOpcode());
+        RewriteCandidate->setDesc(TII.get(NewMFMAOp));
+      }
 
+      // We likely left an identity copy behind after assignment; let
+      // VirtRegRewriter deal with it later.
       MadeChange = true;
     }
   }
@@ -275,6 +357,7 @@ bool AMDGPURewriteAGPRCopyMFMAImpl::run(MachineFunction &MF) const {
 class AMDGPURewriteAGPRCopyMFMALegacy : public MachineFunctionPass {
 public:
   static char ID;
+  RegisterClassInfo RegClassInfo;
 
   AMDGPURewriteAGPRCopyMFMALegacy() : MachineFunctionPass(ID) {
     initializeAMDGPURewriteAGPRCopyMFMALegacyPass(
@@ -320,11 +403,13 @@ bool AMDGPURewriteAGPRCopyMFMALegacy::runOnMachineFunction(
   if (skipFunction(MF.getFunction()))
     return false;
 
+  RegClassInfo.runOnMachineFunction(MF);
+
   auto &VRM = getAnalysis<VirtRegMapWrapperLegacy>().getVRM();
   auto &LRM = getAnalysis<LiveRegMatrixWrapperLegacy>().getLRM();
   auto &LIS = getAnalysis<LiveIntervalsWrapperPass>().getLIS();
 
-  AMDGPURewriteAGPRCopyMFMAImpl Impl(MF, VRM, LRM, LIS);
+  AMDGPURewriteAGPRCopyMFMAImpl Impl(MF, VRM, LRM, LIS, RegClassInfo);
   return Impl.run(MF);
 }
 
@@ -334,8 +419,10 @@ AMDGPURewriteAGPRCopyMFMAPass::run(MachineFunction &MF,
   VirtRegMap &VRM = MFAM.getResult<VirtRegMapAnalysis>(MF);
   LiveRegMatrix &LRM = MFAM.getResult<LiveRegMatrixAnalysis>(MF);
   LiveIntervals &LIS = MFAM.getResult<LiveIntervalsAnalysis>(MF);
+  RegisterClassInfo RegClassInfo;
+  RegClassInfo.runOnMachineFunction(MF);
 
-  AMDGPURewriteAGPRCopyMFMAImpl Impl(MF, VRM, LRM, LIS);
+  AMDGPURewriteAGPRCopyMFMAImpl Impl(MF, VRM, LRM, LIS, RegClassInfo);
   if (!Impl.run(MF))
     return PreservedAnalyses::all();
   auto PA = getMachineFunctionPassPreservedAnalyses();
diff --git a/llvm/test/CodeGen/AMDGPU/inflate-reg-class-vgpr-mfma-to-av-with-load-source.mir b/llvm/test/CodeGen/AMDGPU/inflate-reg-class-vgpr-mfma-to-av-with-load-source.mir
index f8848717808cb..dcb45cc90ca68 100644
--- a/llvm/test/CodeGen/AMDGPU/inflate-reg-class-vgpr-mfma-to-av-with-load-source.mir
+++ b/llvm/test/CodeGen/AMDGPU/inflate-reg-class-vgpr-mfma-to-av-with-load-source.mir
@@ -209,15 +209,14 @@ body:             |
   ; CHECK-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
   ; CHECK-NEXT:   liveins: $vcc, $vgpr2_vgpr3
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   renamable $vgpr0_vgpr1 = GLOBAL_LOAD_DWORDX2 undef renamable $vgpr0_vgpr1, 0, 0, implicit $exec :: (load (s64), addrspace 1)
-  ; CHECK-NEXT:   early-clobber renamable $vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17_vgpr18_vgpr19 = V_MFMA_F32_32X32X8F16_vgprcd_e64 $vgpr2_vgpr3, $vgpr2_vgpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr1...
[truncated]

llvm/test/CodeGen/AMDGPU/jeff_illegal_reduced.ll

jrbyrnes · 2025-08-13T14:20:32Z

llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp

-          int ReplacementOp = AMDGPU::getMFMASrcCVDstAGPROp(UseMI.getOpcode());
-          if (ReplacementOp != -1)
-            UseMI.setDesc(TII.get(ReplacementOp));
+      if (!attemptReassignmentsToAGPR(RewriteRegs, PhysReg)) {


In the case that other MFMAs in the stack have copy to AGPR, should those receive preference towards their specific dest AGPR?

Probably but I'm not sure the hinting is doing anything useful right now, other than a compile time hint to avoid trying the whole allocation order

jrbyrnes · 2025-08-13T14:58:23Z

llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp

-      MachineInstr *CopySrcMI = LIS.getInstructionFromIndex(LRQ.valueIn()->def);
-      if (!CopySrcMI)
+      MachineInstr *MFMA = LIS.getInstructionFromIndex(LRQ.valueIn()->def);
+      if (!MFMA || !isRewriteCandidate(*MFMA))


Should enable rewriting for cases like these?

--- name: copy_multiple_def tracksRegLiveness: true machineFunctionInfo: isEntryFunction: true stackPtrOffsetReg: '$sgpr32' occupancy: 10 sgprForEXECCopy: '$sgpr100_sgpr101' body: | bb.0: S_NOP 0, implicit-def $agpr0 renamable $sgpr0 = S_MOV_B32 0 undef %0.sub8:vreg_512_align2 = V_MOV_B32_e32 0, implicit $exec renamable $sgpr1 = COPY renamable $sgpr0 %1:vreg_64_align2 = COPY killed renamable $sgpr0_sgpr1 renamable $vcc = S_AND_B64 $exec, -1, implicit-def dead $scc %0.sub9:vreg_512_align2 = COPY %0.sub8 undef %0.sub0_sub1:vreg_512_align2 = GLOBAL_LOAD_DWORDX2 undef %3:vreg_64_align2, 12, 0, implicit $exec :: (load (s64), addrspace 1) S_CBRANCH_VCCNZ %bb.1, implicit $vcc S_BRANCH %bb.2 bb.1: liveins: $vcc undef %0.sub0_sub1:vreg_512_align2 = GLOBAL_LOAD_DWORDX2 undef %3:vreg_64_align2, 0, 0, implicit $exec :: (load (s64), addrspace 1) %0:vreg_512_align2 = V_MFMA_F32_32X32X8F16_mac_vgprcd_e64 %1, %1, %0, 0, 0, 0, implicit $mode, implicit $exec S_CBRANCH_VCCNZ %bb.1, implicit $vcc S_BRANCH %bb.2 bb.2: ; No VGPRs available for %0 S_NOP 0, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7 S_NOP 0, implicit-def $vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15 S_NOP 0, implicit-def $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23 S_NOP 0, implicit-def $vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31 S_NOP 0, implicit-def $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39 S_NOP 0, implicit-def $vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47 S_NOP 0, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51_vgpr52_vgpr53_vgpr54_vgpr55 S_NOP 0, implicit-def $vgpr56_vgpr57_vgpr58_vgpr59_vgpr60_vgpr61_vgpr62_vgpr63 %2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec GLOBAL_STORE_DWORDX4_SADDR %2, %0.sub8_sub9_sub10_sub11, undef $sgpr0_sgpr1, 32, 0, implicit $exec :: (store (s128), align 32, addrspace 1) GLOBAL_STORE_DWORDX4_SADDR %2, %0.sub12_sub13_sub14_sub15, undef $sgpr0_sgpr1, 48, 0, implicit $exec :: (store (s128), addrspace 1) GLOBAL_STORE_DWORDX4_SADDR %2, %0.sub0_sub1_sub2_sub3, undef $sgpr0_sgpr1, 0, 0, implicit $exec :: (store (s128), align 128, addrspace 1) GLOBAL_STORE_DWORDX4_SADDR %2, %0.sub4_sub5_sub6_sub7, killed undef $sgpr0_sgpr1, 16, 0, implicit $exec :: (store (s128), addrspace 1) S_ENDPGM 0 ...

I don't see this case fold in this patch, but it does at the top of the stack

If src2 and dst aren't the same register, to fold a copy to AGPR into the instruction we also need to reassign src2 to an available AGPR. All the other uses of src2 also need to be compatible with the AGPR replacement in order to avoid inserting other copies somewhere else. Perform this transform, after verifying all other uses are compatible with AGPR, and have an available AGPR available at all points (which effectively means rewriting a full chain of mfmas and load/store at once).

arsenm · 2025-08-19T15:09:33Z

ping

jrbyrnes

LGTM

arsenm mentioned this pull request Aug 11, 2025

AMDGPU: Handle rewriting VGPR MFMAs with immediate src2 #153016

Merged

arsenm added the backend:AMDGPU label Aug 11, 2025 — with Graphite App

arsenm requested review from jrbyrnes, perlfu, qcolombet and srpande August 11, 2025 14:35

arsenm marked this pull request as ready for review August 11, 2025 14:35

jrbyrnes reviewed Aug 11, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/jeff_illegal_reduced.ll Outdated Show resolved Hide resolved

arsenm force-pushed the users/arsenm/amdgpu/handle-untied-agpr-mfma-rewrite-2 branch from 79409db to 228b088 Compare August 12, 2025 01:30

jrbyrnes reviewed Aug 13, 2025

View reviewed changes

arsenm force-pushed the users/arsenm/amdgpu/handle-untied-agpr-mfma-rewrite-2 branch from 228b088 to 5baebfb Compare August 18, 2025 15:31

This was referenced Aug 19, 2025

AMDGPU: Add baseline test for unspilling VGPRs after MFMA rewrite #154322

Open

AMDGPU: Try to unspill VGPRs after rewriting MFMAs to AGPR form #154323

Open

jrbyrnes approved these changes Aug 20, 2025

View reviewed changes

arsenm merged commit ff5f396 into main Aug 20, 2025
9 checks passed

arsenm deleted the users/arsenm/amdgpu/handle-untied-agpr-mfma-rewrite-2 branch August 20, 2025 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Handle rewriting non-tied MFMA to AGPR form #153015

AMDGPU: Handle rewriting non-tied MFMA to AGPR form #153015

Uh oh!

arsenm commented Aug 11, 2025

Uh oh!

arsenm commented Aug 11, 2025 •

edited

Loading

Uh oh!

llvmbot commented Aug 11, 2025

Uh oh!

Uh oh!

jrbyrnes Aug 13, 2025

Uh oh!

arsenm Aug 15, 2025

Uh oh!

jrbyrnes Aug 13, 2025

Uh oh!

arsenm Aug 18, 2025

Uh oh!

arsenm commented Aug 19, 2025

Uh oh!

jrbyrnes left a comment

Uh oh!

Uh oh!

Uh oh!

AMDGPU: Handle rewriting non-tied MFMA to AGPR form #153015

AMDGPU: Handle rewriting non-tied MFMA to AGPR form #153015

Uh oh!

Conversation

arsenm commented Aug 11, 2025

Uh oh!

arsenm commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 11, 2025

Uh oh!

Uh oh!

jrbyrnes Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

jrbyrnes Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm commented Aug 19, 2025

Uh oh!

jrbyrnes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arsenm commented Aug 11, 2025 •

edited

Loading