-
Notifications
You must be signed in to change notification settings - Fork 113
Open
Labels
enhancementNew feature or requestNew feature or requesttopic: dataIssue about data loader modulesIssue about data loader modules
Description
I can not load utf-8 file while building my vocabulary or loading my dataset because gbk is used by default on windows. I added a new option to allow manually setting encoding PairedTextData. #269
$ python main.py
Traceback (most recent call last):
File "main.py", line 62, in <module>
main()
File "main.py", line 28, in main
hparams=config_data.train, device=device)
File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\data\paired_text_data.py", line 140, in __init__
eos_token=src_hparams.eos_token)
File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 103, in __init__
= self.load(self._filename)
File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 119, in load
vocab = list(line.strip() for line in vocab_file)
File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 119, in <genexpr>
vocab = list(line.strip() for line in vocab_file)
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesttopic: dataIssue about data loader modulesIssue about data loader modules