Sample data for this problem statement can be found in the data folder.
There are two sets of examples, and each set includes a json file and a image file.
For example, data/1.json and data/1.jpg are the json and image files for the first example.
The json file contains the following information -
words: list of words extracted from the page. Each word is represented by a dictionary with the following keys:text: the text of the wordw_min: the x-coordinate of the top-left corner of the wordh_min: the y-coordinate of the top-left corner of the wordw_max: the x-coordinate of the bottom-right corner of the wordh_max: the y-coordinate of the bottom-right corner of the word
table: the location of the transaction table represented as a dictionary with the following keys:w_min: the x-coordinate of the top-left corner of the tableh_min: the y-coordinate of the top-left corner of the tablew_max: the x-coordinate of the bottom-right corner of the tableh_max: the y-coordinate of the bottom-right corner of the table
header: the location of the transaction table header represented as a dictionary with the following keys:w_min: the x-coordinate of the top-left corner of the headerh_min: the y-coordinate of the top-left corner of the headerw_max: the x-coordinate of the bottom-right corner of the headerh_max: the y-coordinate of the bottom-right corner of the header
Notes
- The coordinates are normalized to be between 0 and 1.
- OCR has already been done and the corresponding text is already provided to you. You do not need to do any OCR or Text Extraction.
Given a bank statement (pdf), we are interested in extracting the following information from the document:
- Statement Period, which includes a start date and an end date
- Opening balance amount
- Transaction table headers
- Balance amounts from the transaction table
The expected outputs for the two examples are provided in the tests/test_extractors file.
A visual example of extracted data is shown below:
Evaluation will be done by running tests located in the tests folder.
The tests will check the following:
- The statement period is extracted correctly (
test_statement_period) - The opening balance amount is extracted correctly (
test_opening_balance) - The transaction table headers are extracted correctly (
test_table_headers) - The balance amounts from the transaction table are extracted correctly (
test_table_balance)
Note: Do not modify the tests.
The following are optional tasks that will be evaluated separately:
- Write an API server that accepts a json file and an image file and returns all the extracted data.
- Clone this repository to your local machine.
- You should modify the
src/extractors.pyfile. The functions have already been defined for you. - Run tests using
pytestto check your implementation. - Create a tarball of your code and share it via email.
Note: If you have used any external libraries, please include them in the requirements.txt file.

