A study project for extracting speaker data from British parliamentary debates on Yemen and represent them in a Neo4j database. All scripts are written for Python 3.12. Required dependencies are listed in the requirements.txt file.
For our project we used the ParlaMint-GB corpus which is a linguistically annotated corpus of British parliamentary speeches between 2015 and 2022. It has a TEI-XML format and consists of three different file types: debate files, a speaker metadata file and a party metadata file.
The raw corpus itself is due to its size not included in this repository. For running the scripts download the ParlaMint-GB corpus and save it in data/raw. Then just run the scripts in the order in which they are discussed in the following sections.
The data directory contains the extracted XML files and corresponding Cypher queries for each debate. The scripts directory contains all scripts used for the project.
The yemen_debates.py script extracts all debate files from the corpus that contain discussions about Yemen.
The XSLT_processor.py script is used to apply an XSL transformation from XSLT.xsl to the Yemen debate files and transform them into structured XML files for further processing.
The queries.py script creates Cypher query files based on the resulting XML files from section 2. The database.py script creates the database entries by executing those query files.
Finally, the speaker2speaker_projection.cypher file contains a Cypher query for creating and returning a speaker-to-speaker projection of the bipartite graph. This is done by creating a new edge relation between speakers who participated in the same debate. Then only the speaker nodes with this CO_DEBATED_WITH relation are retrieved.