Skip to content

Can't download datasets if .aws config is present #238

@pvk-developer

Description

@pvk-developer

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDGym version: 0.6.1
  • Python version: Any
  • Operating System: MacOS / Unix / Ubuntu

Error Description

When running on your local environment and it happens to have .aws/ folder with some configuration in it for your AWS, you end up getting the following error:

ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

Steps to reproduce

In order to reproduce the steps create a .aws folder in your home: mkdir ~/.aws then create a file called credentials and add:

[default]
aws_access_key_id = <your id>
aws_secret_access_key = <your access key>

PS: In order for this to work make sure that you have cleared the cache of the downloaded datasets.

import sdgym

In [4]: sdgym.benchmark_single_table(synthesizers=['GaussianCopulaSynthesizer'], sdv_datasets=['student_plac
   ...: ements'], timeout=22)
---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
Cell In[4], line 1
----> 1 sdgym.benchmark_single_table(synthesizers=['GaussianCopulaSynthesizer'], sdv_datasets=['student_placements'], timeout=22)

File ~/Projects/sdv-dev/SDGym/sdgym/benchmark.py:507, in benchmark_single_table(synthesizers, custom_synthesizers, sdv_datasets, additional_datasets_folder, limit_dataset_size, compute_quality_score, sdmetrics, timeout, output_filepath, detailed_results_folder, show_progress, multi_processing_config)
    503 _validate_inputs(output_filepath, detailed_results_folder, synthesizers, custom_synthesizers)
    505 _create_detailed_results_directory(detailed_results_folder)
--> 507 job_args_list = _generate_job_args_list(
    508     limit_dataset_size, sdv_datasets, additional_datasets_folder, sdmetrics,
    509     detailed_results_folder, timeout, compute_quality_score, synthesizers, custom_synthesizers)
    511 scores = _run_jobs(multi_processing_config, job_args_list, show_progress)
    512 if output_filepath:

File ~/Projects/sdv-dev/SDGym/sdgym/benchmark.py:90, in _generate_job_args_list(limit_dataset_size, sdv_datasets, additional_datasets_folder, sdmetrics, detailed_results_folder, timeout, compute_quality_score, synthesizers, custom_synthesizers)
     88 datasets = []
     89 if sdv_datasets is not None:
---> 90     datasets = get_dataset_paths(sdv_datasets, None, None, None, None)
     92 if additional_datasets_folder:
     93     additional_datasets = get_dataset_paths(None, None, additional_datasets_folder, None, None)

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:200, in get_dataset_paths(datasets, datasets_path, bucket, aws_key, aws_secret)
    196     else:
    197         datasets = _get_available_datasets(
    198             'single_table', bucket=bucket)['dataset_name'].tolist()
--> 200 return [
    201     _get_dataset_path('single_table', dataset, datasets_path, bucket, aws_key, aws_secret)
    202     for dataset in datasets
    203 ]

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:201, in <listcomp>(.0)
    196     else:
    197         datasets = _get_available_datasets(
    198             'single_table', bucket=bucket)['dataset_name'].tolist()
    200 return [
--> 201     _get_dataset_path('single_table', dataset, datasets_path, bucket, aws_key, aws_secret)
    202     for dataset in datasets
    203 ]

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:60, in _get_dataset_path(modality, dataset, datasets_path, bucket, aws_key, aws_secret)
     57     if local_path.exists():
     58         return local_path
---> 60 download_dataset(
     61     modality, dataset, dataset_path, bucket=bucket, aws_key=aws_key, aws_secret=aws_secret)
     62 return dataset_path

File ~/Projects/sdv-dev/SDGym/sdgym/datasets.py:36, in download_dataset(modality, dataset_name, datasets_path, bucket, aws_key, aws_secret)
     34 LOGGER.info('Downloading dataset %s from %s', dataset_name, bucket)
     35 s3 = get_s3_client(aws_key, aws_secret)
---> 36 obj = s3.get_object(Bucket=bucket_name, Key=f'{modality.upper()}/{dataset_name}.zip')
     37 bytes_io = io.BytesIO(obj['Body'].read())
     39 LOGGER.info('Extracting dataset into %s', datasets_path)

File ~/.virtualenvs/SDGym/lib/python3.8/site-packages/botocore/client.py:530, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    526     raise TypeError(
    527         f"{py_operation_name}() only accepts keyword arguments."
    528     )
    529 # The "self" in this scope is referring to the BaseClient.
--> 530 return self._make_api_call(operation_name, kwargs)

File ~/.virtualenvs/SDGym/lib/python3.8/site-packages/botocore/client.py:960, in BaseClient._make_api_call(self, operation_name, api_params)
    958     error_code = parsed_response.get("Error", {}).get("Code")
    959     error_class = self.exceptions.from_code(error_code)
--> 960     raise error_class(parsed_response, operation_name)
    961 else:
    962     return parsed_response

ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions