This code can be used to run protest event detection models that were trained as part of my master thesis. There are four different models:
- Haystack classifier: Predict whether a given text is about a protest (binary classification).
- Haystack and form classifier: Predict whether a given text is about a protest, and the form of the protest (multiclass classification).
- Haystack and issue classifier: Predict whether a given text is about a protest, and the issue of the protest (multiclass classification).
- Haystack and target classifier: Predict whether a given text is about a protest, and the target of the protest (multiclass classification).
- Full multitask classifier: Jointly predicts all the above tasks.
Haystack classes
- Non-protest
- Protest
Form classes
- Blockade/slowdown/disruption
- Boycott
- Hunger strike
- March
- Non-protest
- Rally/demonstration
- Riot
- Strike/walkout/lockout
Issue classes
- Anti-colonial/political independence
- Anti-war/peace
- Criminal justice system
- Democratisation
- Economy/inequality
- Environmental
- Foreign policy
- Human and civil rights
- Labour & work
- Non-protest
- Political corruption/malfeasance
- Racial/ethnic rights
- Religion
- Social services & welfare
- None of the above
Target classes
- Domestic government
- Foreign government
- Individual
- Intergovernmental organisation
- Non-protest
- Private/business
Running the Python (3.7+) script requires a couple of packages:
- PyTorch 1.4 (works with and without CUDA)
- huggingface transformers
- tqdm
- numpy
The models can be downloaded from here (approx. 4.2GB). The file must be extracted into the src/
directory, and this can be done using e.g. 7-zip on Windows or with tar
on Linux/Mac.
Using tar
, the command is
tar -xzvf models.tar.gz -C path/to/protest-event-detection/src/
The main script to run is classify_article.py
. It has several parameters:
--article
: Path to a text file with the raw article text to make predictions on. Optimal article length is around 350 words (maximum).--output_path
(optional): Path to where the prediction file will be stored. At default the script does not output predictions to file, but to terminal.--out_file
(optional): Name of the output file where the prediction will be stored, if the above parameter is set. Default name:pred.txt
.--task
: Which prediction model to use. Possible values are haystack, form, issue, target and multi.--mc_samples
(optiona): Number of times to make predictions on the same article. This sets the model in training mode, such that its predictions are stochastic. Then, the final prediction is based on an average of the number of predictions made. This gives additional output (uncertainty estimates). If not set, the model makes a single prediction in evaluation mode. Recommended value is 50 if not zero. Default value: 0.--gpu_devices
(optional): If CUDA is available, will use GPU(s). Here, one can specify a GPU to use, e.g.--gpu_devices 1
if there are multiple GPUs in the system and you want to use the second GPU. With CUDA available, using a single GPU is more than enough to classify one article. Normally, it is not necessary to modify this option. Default value: 0 (the first GPU).
Note: Depending on your setup, running predictions without CUDA/GPU is (relatively) very slow.
Make a single haystack prediction without uncertainty estimates:
python classify_article.py --article ../path/to/article.txt --task haystack
Make 50 Monte Carlo haystack predictions and output to file:
python classify_article.py --article ../path/to/article.txt --mc_samples 50 --task haystack --output_path ../some/path/
Make 50 Monte Carlo haystack and form predictions and output to file with specific name:
python classify_article.py --article ../path/to/article.txt --mc_samples 50 --task form --output_path ../some/path/ --out_file prediction.txt
Make 50 Monte Carlo predictions for all tasks and output to file:
python classify_article.py --article ../path/to/article.txt --mc_samples 50 --task multi --output_path ../some/path/