Pre-Processor

Funded by the EU.

Pre-Processor

This component is responsible for fetching data from datasource scripts and insitu files and uploading them to S3. The container runs pre_pro.app:__main__, which calls pre_pro.execution:PreProcessorExecution.__call__ with incoming messages. The logic is as follows:

Read the eo4eu/datasources-config configmap entry, which is a base64 encoded JSON list of (again) base64 encoded python scripts.
Read the eo4eu/metadata configmap entry, which is a base64 encoded JSON list of dataset metainfo objects following the general EO4EU metainfo spec. Sometimes this metainfo does not exist, in which case the Pre-Processor creates some basic metainfo based on the downloaded files and adds the default dataset names dataset-000, dataset-001, etc...
Read the eo4eu/inSituData configmap entry, which is a path to an S3 object in the eo4eu-insitu bucket.
Read the eo4eu/inSituMeta configmap entry, which is a base64 encoded JSON list of one dataset metainfo object. This usually does not exist, and the Pre-Processor creates some basic metainfo with the dataset name INSITU
Create a list of pre_pro.requests.Request objects, each of which represents a datasource/insitu dataset. The code for fetching the data lies in the .driver field of the request, which has a pre_pro.drivers.DSDriver. The driver itself has one of the three:
- pre_pro.drivers.ScriptFetcher: Runs a datasource script under a less privileged user and detects the new files in the working directory.
- pre_pro.drivers.InsituFetcher: Downloads the insitu S3 object, which is typically an archive file (.zip), and unpacks its contents.
- pre_pro.drivers.InsituV2Fetcher: Downloads the insitu S3 objects, which are specified through the insitu V2 metainfo.
Each request is run through the pre_pro.execution.PreProcessorExecution._execute_request. This first calls pre_pro.drivers.DSDriver.ls on the requests driver. This is where the file downloads happen. The metainfo is compared to the actual files downloaded, and an algorithm tries to match each metainfo entry to a downloaded file.
All files are uploaded to s3://<s3-bucket-name>/source/.
The metainfo objects for each file are joined up into full dataset metainfo objects.
The dataset metainfo objects are then joined up and put into the kafka message going to the next component, as well as to the S3 bucket.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
build		build
dagger		dagger
docs		docs
rules		rules
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
README.md		README.md
dagger.json		dagger.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pre-Processor

About

Uh oh!

Releases

Packages

Languages

License

EO4EU/workflow-pre-pro

Folders and files

Latest commit

History

Repository files navigation

Pre-Processor

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages