-
Notifications
You must be signed in to change notification settings - Fork 2
Fetching and Processing Emails
The extraction.py
tool has been implemented in order to fetch all sent emails of the user, process them in the desired format and save them. Each email is saved in a sentence-per-line format, in order to help us manipulate the sentences of the emails later in clustering.
Usage:
$ python extraction.py -h
usage: extraction.py [-h] --output OUTPUT [--reload] [--info] [--sentence]
Tool for extracting sent emails from a user's account
optional arguments:
-h, --help show this help message and exit
required arguments:
--output OUTPUT Output directory
optional arguments:
--reload If true, remove any existing account.
--info If true, create an info file containing the headers.
--sentence If true, save each sentence of the emails in separate
files.
A token.pickle file is created automatically when the authorization flow completes for the first time. So, in order to fetch all sent emails from a new email account and save them in emails
directory, we use --reload True
argument, as follows:
$ python extraction.py --out emails --reload
In order to connect to an email account, Gmail API is used, that provides flexible RESTful access to the emails of a Gmail account. As a result, only gmail accounts are supported, but the tool can also be extended for more email providers.
After email fetching, the body of each email contains a lot of undesired things, that should be removed. The clean body should contain only Greek words since it will be used as input to the language model tool. In order to achieve it, we use:
- BeautifulSoup library to remove all html characters.
-
num2words library to convert numbers to English words. Then, words are translated into Greek using
convert_num.py
. - alphabet-detector library to detect and keep Greek words.
Also, some emails contain the whole history of the conversation between sender and receiver. Since we need only the new sent email, previous conversations are removed. Finally, we remove all punctuation and non-alphabetic characters and convert all characters to lowercase. An example follows:
Before:
Καλησπέρα σας,
Θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα Machine Learning με κωδικό 12345.
--
Αντωνιάδης Παναγιώτης
After:
['καλησπέρα σας', 'θα ήθελα να ρωτήσω πόσο πήρα στο μάθημα με κωδικό δώδεκα χιλιάδες τριακόσια σαράντα πέντε ']
We can see that the signature is removed and the course code has been converted into Greek words. It should be noted, that the salutation καλησπέρα σας
is considered a separated sentence for semantic reasons.
Finally, each clean email is saved in out
directory as email_{id}
(one sentence per line). Also, by applying the --info True
argument an info file is saved, that contains the headers of the emails in the following format:
sender | receiver | subject
.