[BUG]LigerFusedLinearCrossEntropyLoss(reduction='none') results in wrong grad, even causes grad=0

### 🐛 Describe the bug

Using reduction=none, LigerFusedLinearCrossEntropyLoss returns wrong grads when multiply a designed weight_mask.
1. For target_labels has ignore_index, grads incorrectly returns all 0
Code:
<img width="624" height="405" alt="Image" src="https://github.com/user-attachments/assets/591257e3-310a-4473-9f66-ee72ff9b91e4" />

Output:
<img width="520" height="71" alt="Image" src="https://github.com/user-attachments/assets/fc52a637-d052-4cd5-aa29-426b23e8d6e2" />

2. If multiplying a desined weight_mask, grad is not correct (inconsistent with LigerCrossEntropyLoss)
Code:
<img width="622" height="402" alt="Image" src="https://github.com/user-attachments/assets/b41eba41-2d99-4ed1-a370-6df15a4108c0" />

Output:
<img width="664" height="83" alt="Image" src="https://github.com/user-attachments/assets/f9bb8dc4-c429-4568-a345-dde9ad17a512" />


Reproduce code is uploaded below.

### Reproduce

    
    import torch
    
    from torch.nn import CrossEntropyLoss
    
    from liger_kernel.transformers import LigerCrossEntropyLoss
    from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss
    
    # B, T, H, V = 2, 2048, 256, 32000
    # B, T, H, V = 2, 1, 10, 15
    B, T, H, V = 2, 4, 10, 15
    
    ignore_index = -100
    reduction = "none"
    device = "cuda"
    dtype = torch.float32
    scalar = 2
    
    atol, rtol = 1e-8, 1e-5
    
    use_ignore_index = False
    
    if use_ignore_index:
        # Passed
        target_ce = LigerCrossEntropyLoss(ignore_index=ignore_index, reduction=reduction)
        target_flce = LigerFusedLinearCrossEntropyLoss(ignore_index=ignore_index, reduction=reduction)
        # torch_ce = CrossEntropyLoss(ignore_index=ignore_index, reduction=reduction)
    else:
        # Also passed
        target_ce = LigerCrossEntropyLoss(reduction=reduction)
        target_flce = LigerFusedLinearCrossEntropyLoss(reduction=reduction)
    
    _tensor = torch.randn(B * T, H, device=device, dtype=dtype) * scalar
    lin_weight = torch.randn(V, H, device=device, dtype=dtype)
    
    _input1 = _tensor.detach().clone().requires_grad_(True)
    lin_weight1=lin_weight.clone().detach().requires_grad_(True)
    _input1_mul_weight = (_input1@lin_weight1.transpose(0,1))
    
    _input2 = _tensor.detach().clone().requires_grad_(True)
    lin_weight2=lin_weight.clone().detach().requires_grad_(True)
    
    
    target = torch.randint(0, V, (B * T,), device=device, dtype=torch.long)
    
    # # Assign some random number of elements as ignore_index
    # num_elements_to_assign = torch.randint(
    #     1, B * T // 2, (1,)
    # ).item()  # Random number of elements to set to ignore_index
    # indices_to_assign = torch.randperm(B * T)[:num_elements_to_assign]  # Randomly select indices
    # target[indices_to_assign] = ignore_index
    
    # target[:B*T//2] = ignore_index
    target[:1] = ignore_index
    
    if use_ignore_index:
        output = target_ce(_input1_mul_weight, target)
        output2 = target_flce(lin_weight2, _input2, target)
        mask = (target != -100)
        loss1 = (output * mask).sum() / mask.sum()
        loss2 = (output2 * mask).sum() / mask.sum()
    else:
        output = target_ce(_input1_mul_weight, target)
        output2 = target_flce(lin_weight2, _input2, target)
        mask = (target != -100).type_as(output2)
        # mask=torch.randn(B*T,device=device, dtype=dtype)
        print(f'weight_mask:{mask}')
        loss1 = (output * mask).sum()
        loss2 = (output2 * mask).sum()
        # loss1=output
        # loss2=output2*output2
    
    print(f'loss1:{loss1}')
    print(f'loss2:{loss2}')
    
    loss1.backward(gradient=torch.ones_like(loss1))
    loss2.backward(gradient=torch.ones_like(loss2))
    
    print(f'grad1_sum:{torch.sum(torch.abs(_input1.grad))}')
    
    print(f'grad2_sum:{torch.sum(torch.abs(_input2.grad))}')
    # assert torch.allclose(loss1, loss2, atol=atol, rtol=rtol)
    # assert torch.allclose(_input1.grad, _input2.grad, atol=atol, rtol=rtol)

### Versions

Environment Versions:
-------------------
Python version: 3.11.11
Liger Kernel version: 0.6.4
PyTorch version: 2.6.0+cu126
CUDA version: 12.6
Triton version: 3.2.0
Transformers version: 4.56.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]LigerFusedLinearCrossEntropyLoss(reduction='none') results in wrong grad, even causes grad=0 #968

🐛 Describe the bug

Reproduce

Versions

Environment Versions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]LigerFusedLinearCrossEntropyLoss(reduction='none') results in wrong grad, even causes grad=0 #968

Description

🐛 Describe the bug

Reproduce

Versions

Environment Versions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions