Skip to content

File processing app that perform tasks equivalent to an ESB (Enterprise Service Bus). Move files between folders and process metadata information into catalogues and log tables.

Notifications You must be signed in to change notification settings

InovaFiscaliza/Scarab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

336 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ask DeepWiki

Table of Contents
  1. About Scarab
  2. Scripts and Files
  3. How it works
  4. Companion Services
  5. Tests
  6. Setup
  7. Roadmap
  8. Contributing
  9. License

About Scarab

This app is intended to run as a service and perform ESB (Enterprise Service Bus) tasks by moving file between input and output folders while performing basic file processing tasks including: file type checking, backup, metadata aggregation and logging.

Metadata files are expected to be tables in XLSX or CSV format, with the first row as the column headers, or JSON arrays and dictionaries.

XLSX and JSON files may contain multiple tables defined in sheets or dictionaries entries in the first level, respectively. Association between the tables is defined by "Primary Key" and "Foreign Key" columns, and may be defined with absolute (UI) or relative values (specific to each file).

Metadata may also be extracted from the filenames using regex patterns and the filename itself may be stored in the metadata file.

Application is written in Python and uses the UV package for environment management and intended to run as a service. Exemples of service configuration files for the Windows Task Manager are provided in the data/examples folder.

Scripts and Files

Script module Description
scarab.py main script to run the service
config_handler.py module responsible for handling the configuration file parsing, validation and processing into the used configuration object
default_config.json default configuration file, used by the script to fill optional values in user configuration files. May be edited to change the default values.
log_handler.py module responsible for handling configuration of the standard python logging module from the configured parameters, such as enabling the selected output channels and message formatting
file_handler.py module responsible for handling the file operations such as copy, move and delete
metadata_handler.py module responsible for handling the metadata operations, including reading, merging and storing

Apart from scripts, the application uses a configuration file to define the application parameters, such as folder paths, folder revisit timing, log method, log level, and log file name, among others.

Please check Scarab documentation for more details on the configuration file.

Examples of configuration files can be found in the defined tests.

How it works

Script is made to run as a windows service reading for the task definition in a configuration file that uses JSON format.

The script can monitor multiple folders. As soon as any content is changed, the script sort the content into 3 groups, corresponding to data files to be moved, metadata files to be processed and other files or folders that may be deleted or ignored.

Identification of the files is based on their names using a regex pattern defined in the configuration file. Metadata files are further evaluated by their content, which is expected to contain a minimum set of data columns, otherwise the file will be considered invalid and treated as an unknown file.

Metadata files are expected to be tables (XLSX or CSV format), with the first row as the header, arrays and dictionaries in JSON format.

One or several columns might be used as key columns to uniquely identify each row.

The script will concatenate tables and update the rows based on the key columns. During updates, columns with null values are ignored. To entirely remove data, the service must be stopped and the consolidated metadata file manually edited.

Column order in the consolidated metadata file is the same as the last metadata file processed. Columns existing in previous metadata files but not in the new one are kept with minimal order changes, as close as possible to the original neighborhood, otherwise, they are appended to the end of the table.

Additional columns may be added to the consolidated metadata file, including the filename itself and parsed information from the filename using regex groupby.

Rows may be ordered by any indicated column, by default they will be sorted following the order by which they were created when processing the input files.

For XLSX files with multiple sheets, JSON files with multiple dictionaries in the root, or CSV files with significantly different column set or respecting different regex rules, the script can create a multi table consolidated result, including relational association between tables using Primary Key (PK) and Foreign Key (FK) relationships. Such relationships may be relative, within a single file, or absolute, across multiple files.

The consolidated metadata file is stored in multiple output folder and different output formats may be used, primary XLSX but also csv, json, qvd (qlik sense) and parquet. For single table formats (csv, qvd and parquet), when consolidating multi-table data, the base name of the catalog file will have added the suffix "_table", to indicate which table the data is associated to.

Data files may be of any type and may be moved to multiple output folders. Different output folders may be set for different file regex.

References to the data files within the consolidated metadata file may be used to add additional metadata information, indicating that the data file was moved to the output folder.

Data files without the corresponding metadata may be hold in temp folders and eventually moved to trash.

Input folder cleaning policies may be defined in the configuration file, including moving files to a trash folder or deleting them, thus handling unidentified files.

Exception to the cleaning policies may be defined to allow the use of specific input folder structure or permanence of files that may not be processed.

A log is used to keep track of the script execution, being possible to have the log presented in the terminal and/or saved to a file.

To stop, the script monitor the occurrence of kill signal from the system or ctrl+c if running in the terminal.

Companion Services

Companion services were developed to extend the application functionalities and create a production environment integrated with MS Sharepoint and cloud services, including:

  • PowerAutomate script that extract metadata from files uploaded to MS Sharepoint repositories, through a browser or OneDrive client application. Metadata and corresponding uploaded files are placed into restricted repositories monitored by Scarab service. Example of such script is provided in the src/PA folder.
  • Windows Scheduler to run Scarab service as a Windows Task. Examples are provided in the src/Scheduler folder. This allows the service to run in a machine without user intervention, starting with the system and restarting in case of failure in a machine capable of also running other companion services, such as OneDrive Client Application, enabling Scarab to access Sharepoint repositories as local synced folders, without the need of additional coding for Sharepoint API access, that may be restricted in some environments.

Tests

Testes are proposed for different scenarios to validate the scripts and modules.

Please check the tests folder for more details.

Setup

Scripts were intended to be used in a Windows machine with UV package and environment management.

You may simply clone the repository and run the script with the following command or follow the install procedures

For more details about UV, please check the UV documentation

Please check Scarab documentation for more details on the configuration file.

Additional examples can be found in the data folder of the repository.

These examples include.

  • .json configuration files for the application in some scenarios currently in use.
  • .xml files for the Windows task manager to run the application as a service.
  • zip file with a companion script to be used with PowerAutomate to extract data from MS Sharepoint repositories and post them to the input folders.

Roadmap

This section presents a simplified view of the roadmap.

  • Version 1.0.0: 21/02/2025, initial release
    • version 1.0.1: 31/03/2025, bug fix
    • version 1.1.0: 14/04/2025, Row ordering, ignore feature and scarab companion
  • Version 2.0.0: 08/07/2025, multi table support with PK/FK update, Advanced regex, Filename processing
    • version 2.1.0: 08/09/2025, Improved validation and error handling, multiple output folders, automatic file encoding identification
    • version 2.1.1: 30/01/2026, fix column order issue, update docstring and typing hints. Add examples and tests. PA companion update to vectorized processing.
  • Sharepoint direct access through API and open issues

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

License

Distributed under the GNU General Public License (GPL), version 3. See LICENSE.txt.

For additional information, please check https://www.gnu.org/licenses/quick-guide-gplv3.html

This license model was selected with the idea of enabling collaboration of anyone interested in projects listed within this group.

It is in line with the Brazilian Public Software directives, as published at: https://softwarepublico.gov.br/social/articles/0004/5936/Manual_do_Ofertante_Temporario_04.10.2016.pdf

Further reading material can be found at:

References

About

File processing app that perform tasks equivalent to an ESB (Enterprise Service Bus). Move files between folders and process metadata information into catalogues and log tables.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors