AI Papers Cleaner

Extract text from papers PDFs and abstracts, and remove uninformative words. This is helpful for building a corpus of papers to train a language model.

Based on CVPR_paper_search_tool by Jin Yamanaka. I decided to split the code into multiple projects:

AI Papers Scrapper - Download papers pdfs and other information from main AI conferences
this project - Extract text from papers PDFs and abstracts, and remove uninformative words
AI Papers Search Tool - Automatic paper clustering
AI Papers Searcher - Web app to search papers by keywords or similar papers
AI Conferences Info - Contains the titles, abstracts, urls, and authors names extracted from the papers

I made recent changes about the data format (changed most of it to tsv since it avoids some errors with the abstracts and titles of papers. If you have data in the old format (csv files with ; or | as separators) and want to convert it to the new version, check the section below.

Requirements

Docker or, for local installation:

Python 3.11+
Poetry

Usage

To make it easier to run the code, with or without Docker, I created a few helpers. Both ways use start_here.sh as an entry point. Since there are a few quirks when calling the specific code, I created this file with all the necessary commands to run the code. All you need to do is to uncomment the relevant lines inside the conferences array and run the script. Also, comment/uncomment the following as needed:

extract_pdfs=1
extract_urls=1
clean_abstracts=1
clean_papers=1

You'll need to download some nltk data. To do this, read the relevant section according to your usage method below.

Running without Docker

You first need to install Python Poetry. Then, you can install the dependencies and run the code:

poetry install
bash start_here.sh

Downloading nltk data

To download the nltk data, run the following:

poetry run ipython3

Then, inside the Python shell:

import nltk
nltk.download('stopwords')

Running with Docker

To help with the Docker setup, I created a Dockerfile and a Makefile. The Dockerfile contains all the instructions to create the Docker image. The Makefile contains the commands to build the image, run the container, and run the code inside the container. To build the image, simply run:

make

To call start_here.sh inside the container, run:

make run

Downloading nltk data

To download the nltk data, run the following:

make RUN_STRING="ipython3" run

Then, inside the Python shell:

import nltk
nltk.download('stopwords')

Checking the cleaning process

The best way to check how the cleaning process works for a specific paper is by running the clean_paper.sh script. You can set inside the following variables:

# clean_abstracts=1
clean_papers=1

index=1
# title="Moon IME: Neural-based Chinese Pinyin Aided Input Method with Customizable Association"
conf=aaai
year=2017

To check the abstract cleaning process, uncomment the clean_abstracts line and comment the clean_papers line. To check the paper cleaning process, reverse the comments. You need to set the conf and year variables to the conference (as displayed in the conferences array in start_here.sh) and year of your choice, and set one of index or title variables. The index variable is the index of the paper in the abstracts.tsv or pdfs.csv file, while title can be a part of the title of the paper. If you set both, the index variable will be used. To call the clean_paper.sh script, run:

bash clean_paper.sh # if you're running without Docker
make RUN_STRING="bash clean_paper.sh" run # if you're running with Docker

Migrating from old data

Recently I changed how the data is stored after scraping in the scraping project, replacing all separators (| for abstracts or ; for the rest of the files) for tabs (\t) and changing the file extension to .tsv. This was done to avoid conflicts when these symbols were used in the abstract, and even in the case where the authors were being parsed with ; instead of , between them. To migrate your code to the new format, simply create a bash script in your data directory with the following content and run it:

#!/bin/bash
find . -type f -name '*.csv' -print0 | sort -t '\0' -n | while IFS= read -r -d $'\0' file; do
  if [[ $(head -n1 "$file") == *"|"* ]]; then
    # if file uses | as separator, replace the first occurrence of | in every line to \t
    echo "Processing $file"
    sed -i 's/|/\t/' $file
    mv $file "$(dirname $file)/$(basename $file .csv).tsv"
  elif [[ $(head -n1 "$file") == *";"* ]]; then
    # if file uses ; as separator, replace all occurrences of ; in every line to \t
    echo "Processing $file"
    sed -i 's/;/\t/g' $file
    mv $file "$(dirname $file)/$(basename $file .csv).tsv"
  fi
done

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
.ruff.toml		.ruff.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
add_papers_with_code.py		add_papers_with_code.py
clean_paper.sh		clean_paper.sh
pdf_extractor.py		pdf_extractor.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
start_here.sh		start_here.sh
text_cleaner.py		text_cleaner.py
timer.py		timer.py
unify_papers_data.py		unify_papers_data.py
url_scrapper.py		url_scrapper.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Papers Cleaner

Requirements

Usage

Running without Docker

Downloading nltk data

Running with Docker

Downloading nltk data

Checking the cleaning process

Migrating from old data

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

george-gca/ai_papers_cleaner

Folders and files

Latest commit

History

Repository files navigation

AI Papers Cleaner

Requirements

Usage

Running without Docker

Downloading nltk data

Running with Docker

Downloading nltk data

Checking the cleaning process

Migrating from old data

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages