Skip to content

Encoding error on windows #270

@imgaojun

Description

@imgaojun

I can not load utf-8 file while building my vocabulary or loading my dataset because gbk is used by default on windows. I added a new option to allow manually setting encoding PairedTextData. #269

$ python main.py 
Traceback (most recent call last):
  File "main.py", line 62, in <module>
    main()
  File "main.py", line 28, in main
    hparams=config_data.train, device=device)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\data\paired_text_data.py", line 140, in __init__
    eos_token=src_hparams.eos_token)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 103, in __init__
    = self.load(self._filename)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 119, in load
    vocab = list(line.strip() for line in vocab_file)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 119, in <genexpr>
    vocab = list(line.strip() for line in vocab_file)
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttopic: dataIssue about data loader modules

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions