Skip to content

EO4EU/workflow-post-pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The EO4EU logo

Funded by the EU.

Post-Processor

This component is responsible for transferring the finished data files from the workflow to the downstream components in an appropriate form. For CF, they are uploaded to an OpenSearch database hosted on the cluster; For AR_XR, a list of S3 paths for relevant image files is given.

Design

The main function is post_pro/app.py:run, which spawns a KafkaConsumer and hands it a handler implemented in post_pro/execution.py:PostProcessorExecution. Upon receiving a message, it will do the following:

  • Download all valid data files from the S3 bucket it is connected to; this includes .csv, .json, .xml, .xlsx, etc (this may be seen in post_pro/pipeline.py:FileKind), as well as .tar and .zip archives. The latter are extracted, and their contents treated as if they were in the S3 bucket. The archives may have a nested internal structure.

  • Upload the downloaded and extracted files to OpenSearch. In order for this to happen, the files must be in some sort of tabular form. Each file is assigned a (hopefully unique) index; indices must be strictly alphanumeric and lowercase, so they are generated by taking the filepath, removing all alphanumeric characters and converting to lowercase. For files originating in archives, the name of the archive file is attached to the front. For example:

    • File <s3_bucket_name>/store/b10/b10data-0.csv will have the index storeb10b10data0csv
    • File <s3_bucket_name>/Training.xml will have the index trainingxml
    • File <s3_bucket_name>/archive.tar.xz:Region2/pr0.dbf will have the index archivetarxzregion2pr0dbf

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages