This repository contains the code, preproccessed datasets, and intermediate data of the paper:
We have collected four datasets from Github. Cloud native computing foundation is used as a keyword to collect repositories for CNCF data. Similarly, keywords java and IBM, machine learning, data mining, and IBM are used to collect repositories for Java and machine learning (ML) datasets. Finally, search terms network embedding, graph neural networks, and graph convolutional network are utilized to get network embedding (NE) dataset. For each dataset, the description and the content of read me file of each repository are used as a its description after text pre-processing such as stop-word, symbol, number, link, html tags removals. Furthermore, the LDA topic modeling is applied on the description of the repositories to get LDA topics.
To run the code the following packages and apis are required: numpy 1.18.5, tensorflow, networkx 2.4, pickle5, simplejson, scikit-learn, matplotlib, beautifulsoup4, lxml, nltk==3.5, HTMLParser