Request: Separate thresholds for valid topics and invalid topics.

As of writing, there's only one threshold for the zero-shot topics that's used as a cutoff for whether a topic is considered 'found' or not.  Having separate thresholds for the positive and negative side of the equation would allow for us to perform more nuanced filtering, like: "It _might_ not be about sports, but it's definitely not about travel."  

Consider the case where our threshold is 0.5, the default. If we assume the false-positive rate here 4%[1] then adding ten negative topics means our odds of accidentally flagging something is 1-((1-0.04)*...*(1-0.04)), or 33%.  

It would be nice to be able to tune that.

I imagine the change would be something akin to:
```
        candidate_topics = model_input["valid_topics"] + model_input["invalid_topics"]
        thresholds = [self._zero_shot_threshold_valid]*len(model_input["valid_topics"]) + [self._zero_shot_threshold_invalid]*len(model_input["invalid_topics"])

        result = self._classifier(text, candidate_topics)
        topics = result["labels"]
        scores = result["scores"]
        found_topics = []
        for topic, score, threshold in zip(topics, scores, thresholds):
            if score > threshold:
                found_topics.append(topic)
```

[1] Source: lost the original link so the new source is 'trust me, friendo'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request: Separate thresholds for valid topics and invalid topics. #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Separate thresholds for valid topics and invalid topics. #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions