Skip to content

Conversation

@686f6c61
Copy link

@686f6c61 686f6c61 commented Jan 9, 2026

Fixes #43116

Summary

This PR addresses multi-label classification bugs in run_classification.py and adds confidence scores output following community feedback.

Bug fixes

Fixed 4 bugs that broke multi-label classification with JSON datasets:

  1. Missing imports for SequenceFeature and expit
  2. Regression detection fails with AttributeError on JSON datasets - fixed by adding hasattr() check before accessing dtype
  3. Multi-label detection misses JSON-based datasets - fixed by adding isinstance() check for SequenceFeature
  4. Predictions missing sigmoid activation - added expit() before thresholding in both compute_metrics and do_predict

New features

Based on feedback from @ziorufus, added configurable threshold and confidence scores output.

New parameters:

  • --output_confidence_scores (bool, default: False) - Output JSON with confidence scores instead of binary predictions
  • --multi_label_threshold (float, default: 0.5) - Threshold for converting probabilities to binary predictions
  • --top_k_labels (int, optional) - Limit output to top K most confident labels

Output format follows transformers Pipeline API convention:

Traditional mode (default):

index   prediction
0       ['positive', 'urgent']

Confidence scores mode:

[
  {
    "index": 0,
    "predictions": [
      {"label": "positive", "score": 0.89},
      {"label": "urgent", "score": 0.67}
    ]
  }
]

Backward compatibility

Default behavior unchanged. New features require explicit flags. No breaking changes.

Testing

Validated with:

  • Ruff linting and formatting (all checks passed)
  • Python syntax check
  • Logic validation with simulated multi-label data
  • Backward compatibility verification

Test scenarios:

  • Traditional TSV output with default threshold
  • JSON confidence scores output
  • Custom threshold values
  • Top-K filtering

Usage examples

Traditional mode:

python run_classification.py \
  --model_name_or_path bert-base-uncased \
  --test_file test.json \
  --do_predict \
  --output_dir ./output

Confidence scores:

python run_classification.py \
  --model_name_or_path bert-base-uncased \
  --test_file test.json \
  --do_predict \
  --output_confidence_scores \
  --output_dir ./output

Custom threshold:

python run_classification.py \
  --test_file test.json \
  --do_predict \
  --multi_label_threshold 0.3 \
  --output_dir ./output

Top-K labels:

python run_classification.py \
  --test_file test.json \
  --do_predict \
  --output_confidence_scores \
  --top_k_labels 3 \
  --output_dir ./output

Implementation notes

  • Follows Pipeline API format used by text-classification and zero-shot pipelines
  • Scores sorted descending by confidence
  • JSON output enables downstream processing and custom threshold application
  • Type hints and documentation complete

Changes

  • examples/pytorch/text-classification/run_classification.py (+78, -20)
  • 3 new parameters added
  • 1 import added
  • 58 net lines added

Fixes four bugs that prevented multi-label classification from working
with JSON data files:

1. AttributeError when detecting regression (line 416)
2. AttributeError in regression type casting (line 432)
3. AttributeError when detecting multi-label (line 442)
4. Empty predictions due to missing sigmoid (lines 651, 718)

Changes:
- Add hasattr() checks before accessing dtype attribute
- Use isinstance() for proper multi-label detection
- Apply sigmoid before thresholding in predictions

All changes are backwards compatible and tested with single-label,
multi-label, and regression tasks.

Fixes huggingface#43116
Fix formatting issues detected by CircleCI check_code_quality.
Implement configurable threshold and confidence scores output following
transformers Pipeline API conventions:

- Add --output_confidence_scores flag (default: False for backward compatibility)
- Add --multi_label_threshold parameter (default: 0.5, configurable)
- Add --top_k_labels parameter to limit output to top K labels
- Output JSON format with {"label": str, "score": float} when enabled
- Maintain backward compatible TSV format when disabled

This addresses feedback from issue huggingface#43116 to provide more flexibility
for multi-label classification workflows.
@686f6c61
Copy link
Author

Updated this PR to include confidence scores output based on @ziorufus feedback.

What changed

Added three new parameters for multi-label classification:

  • --output_confidence_scores - Output JSON with scores instead of binary 0/1 (default: False for backward compatibility)
  • --multi_label_threshold - Configurable threshold for binary predictions (default: 0.5, was hardcoded before)
  • --top_k_labels - Limit to top K most confident labels (optional)

Implementation

Following transformers Pipeline API convention, output format is:

[
  {
    "index": 0,
    "predictions": [
      {"label": "positive", "score": 0.89},
      {"label": "urgent", "score": 0.67}
    ]
  }
]

This matches text-classification and zero-shot pipelines for consistency.

Why I think this could be useful

@ziorufus suggested that outputting raw scores could give users more flexibility to:

  • Apply custom thresholds post-prediction
  • See model confidence per label
  • Integrate with downstream systems more easily

I implemented this approach but I'm open to feedback and suggestions for improvements if there are better ways to handle this.

Default behavior unchanged - traditional TSV output with threshold=0.5. New features are opt-in.

Testing

All code quality checks passed:

  • Ruff linting
  • Ruff formatting
  • Logic validation
  • Backward compatibility verified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-label classification always returns empty results in run_classification.py example script

1 participant