Skip to content

Getting Error while running csv example for my file  #136

@purnima1612

Description

@purnima1612

Hello all ,
I am trying to run csv exmaple for my file which has 850 records . Also I am trying to find duplicates based on custom function which Levenshtein distance . Trying to group all names under one entity_num which shre match of name more than 80% .

While preparning data I changed smaple size to 50
deduper.prepare_training(data_d,sample_size=50 )

after I finish labeling I am getting following error


Traceback (most recent call last):
  File "C:\Python_Projects\Python_extra_code\csv_example.py", line 132, in <module>
    deduper.train()
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\api.py", line 1215, in train
    self.predicates = self.active_learner.learn_predicates(recall, index_predicates)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 397, in learn_predicates
    return self.blocker.learn_predicates(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\labeler.py", line 136, in learn_predicates
    return self.block_learner.learn(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\training.py", line 72, in learn
    candidate_cover = self.random_forest_candidates(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\site-packages\dedupe\training.py", line 112, in random_forest_candidates
    sample_predicates = random.sample(predicates, pred_sample_size)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Dev\Python3.11\Lib\random.py", line 453, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

Process finished with exit code 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions