Funded by the EU.
This component is responsible for fetching data from datasource scripts and insitu files and uploading them to S3. The container runs pre_pro.app:__main__, which calls pre_pro.execution:PreProcessorExecution.__call__ with incoming messages. The logic is as follows:
-
Read the
eo4eu/datasources-configconfigmap entry, which is a base64 encoded JSON list of (again) base64 encoded python scripts. -
Read the
eo4eu/metadataconfigmap entry, which is a base64 encoded JSON list of dataset metainfo objects following the general EO4EU metainfo spec. Sometimes this metainfo does not exist, in which case the Pre-Processor creates some basic metainfo based on the downloaded files and adds the default dataset namesdataset-000,dataset-001, etc... -
Read the
eo4eu/inSituDataconfigmap entry, which is a path to an S3 object in theeo4eu-insitubucket. -
Read the
eo4eu/inSituMetaconfigmap entry, which is a base64 encoded JSON list of one dataset metainfo object. This usually does not exist, and the Pre-Processor creates some basic metainfo with the dataset nameINSITU -
Create a list of
pre_pro.requests.Requestobjects, each of which represents a datasource/insitu dataset. The code for fetching the data lies in the.driverfield of the request, which has apre_pro.drivers.DSDriver. The driver itself has one of the three:pre_pro.drivers.ScriptFetcher: Runs a datasource script under a less privileged user and detects the new files in the working directory.pre_pro.drivers.InsituFetcher: Downloads the insitu S3 object, which is typically an archive file (.zip), and unpacks its contents.pre_pro.drivers.InsituV2Fetcher: Downloads the insitu S3 objects, which are specified through the insitu V2 metainfo.
-
Each request is run through the
pre_pro.execution.PreProcessorExecution._execute_request. This first callspre_pro.drivers.DSDriver.lson the requests driver. This is where the file downloads happen. The metainfo is compared to the actual files downloaded, and an algorithm tries to match each metainfo entry to a downloaded file. -
All files are uploaded to
s3://<s3-bucket-name>/source/. -
The metainfo objects for each file are joined up into full dataset metainfo objects.
-
The dataset metainfo objects are then joined up and put into the kafka message going to the next component, as well as to the S3 bucket.
