Advanced Data Analysis on Research Publications. Grouping of papers with the usage of a knowledge graph.
The objective of this project is a advanced Data Analysis on Research Publications. Given a corpus of 30 papers, the papers have been grouped according to their topics and similarities; and a knowledge graph has been constructed from those relationships. Additionally, the knowledge graph has been enriched with metadata using sources like OpenAlex and OpenAire.
You can find the documentation here
To run this program you will need:
- Docker which is a software that provides a convenient way to package, distribute and run applications as containers, ensuring consistency across different environments.
- Operating Systems such as Windows or Linux.
Step 1: Clone the repository from GitHub to your local machine:
git clone https://github.com/RubenCid35/GraphPaperSim.git && cd GraphPaperSim
Step 2: Start the docker server. In windows, you can it from the Docker Desktop o from services.
To execute the project, you only need to run the programs located in the app folder. Inside it, you will find the RDF generated in .ttl (output.ttl) format for the 30 articles.
To execute, follow these steps:
Step 1: Start the docker server.
Step 2: Go to the "app" folder.
cd app
Step 3: Execute the following command.
docker-compose up
The program will run at http://127.0.0.1:8050/ .
To close the program, execute:
docker-compose down
We will explain how the RDF of the 30 papers used to create the program was obtained. (Its execution is not necessary to run the previous program).
Step 1: First, Grobid is used to obtain information (title, abstract, acknowledgements) from the 30 papers. To do this, Grobid needs to be running:
docker pull lfoppiano/grobid:0.7.2
docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.2
Then, the code/grobid.py file is executed to obtain the results in results/results.json.
Step 2: Next, topic modeling, clustering, NER, and metadata retrieval are performed using OpenAlex and OpenAire. To do this, the following programs are executed:
code/topic.pyto obtain the existing topics (results/topic.json) and the probability of each paper to belong a topic (results/topic_prob.json).code/similarity.pyto obtain the similarity between papers (results/similarity_results.json).code/acknowledgment.pyfor the NER model of acknowledgments (results/acknowledgment.json).code/openalex_openaire.pyfor extracting external information from papers (results/papers_info.json,results/authors_info.jsonyresults/institutions_info.json)
Step 3: With the aforementioned JSON files, the output.ttl file has been obtained with RML Mapper. This tool allows the user to execute a RML rules (that are store in the files in mappings) to generate Linked Data. To use tool with the previous results, first the user needs to download the tool .jar from releases section and execute
the following command:
java -jar .\rmlmapper-6.5.1-r371-all.jar -m .\mappings\transformations.ttl -o app/output.ttl -s turtle In the application, there are two tabs: "Consulta-Query" (for querying the Knowledge Graph) and "Sobre Nosotros".
Below is an example of a query.
PREFIX onto: <http://upm.ontology.es/papers#>
SELECT ?sub ?obj WHERE {
?sub onto:hasName ?obj .
} LIMIT 10
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX onto: <http://upm.ontology.es/papers#>
SELECT ?name ?uri ?topic ?score ?words
WHERE {
?uri onto:hasTitle ?name;
onto:hasTopic ?topicAssign.
?topicAssign onto:score ?score;
onto:assign ?topic.
?topic onto:hasWords ?words.
FILTER( ?score < 0.9)
} LIMIT 10
- Yimin Zhou.
- Rubén Cid Costa.
- Rodrigo Durán Andrés.
- Rubén Cid Costa: [email protected]
- Rodrigo Durán Andrés: [email protected]
- Yimin Zhou.: [email protected]
