Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
73ac7d4
Add new type of tagger
Jun 2, 2025
347ef4e
Prepare to re-add kfold
Jun 3, 2025
2417e49
Load model when server runs, listen for url
cnewman Jun 3, 2025
4dcd8d4
A half vibe coded mess, but I think it works. Needs a ton of clean up.
cnewman Jun 4, 2025
b30e928
Fix bug with the masking
cnewman Jun 4, 2025
a84f3ad
Remove system as a feature
cnewman Jun 4, 2025
ca22c59
Update to pull from huggingface or local based on --local
cnewman Jun 4, 2025
e083b39
Fix requirements and I dunno how the crf imports are working
Jun 4, 2025
2703566
Remove req that won't work on windows
cnewman Jun 4, 2025
e135cd6
Greatly reduce the requirements.txt to just the top level reqs
Jun 4, 2025
68a1466
Merge branch 'distilbert' of github.com:SCANL/scanl_tagger into disti…
Jun 4, 2025
059eeb0
Make it so that classification report gets printed to a file
cnewman Jun 4, 2025
ecc8855
Update readme
Jun 4, 2025
6ea557f
DRY
Jun 5, 2025
dc5c8a4
Remove reliance on NLTK. Does not reduce effectiveness of the model, and
cnewman Jun 8, 2025
f067186
Add current metrics
cnewman Jun 8, 2025
26857a1
Tested tree and lm based run and train. Did some thorough documenting…
Jun 10, 2025
bde70e2
Update readme with new arguments and data
Jun 10, 2025
89748a0
git workflow
Jun 10, 2025
28dd47c
Starting to see if I can get Doker up again. Update requirements with…
Jun 10, 2025
4ed9cd3
add download_files() to lm execution flow for the way we are currentl…
Jun 10, 2025
cd8d94e
Forgot to run process in the bg for github actions
Jun 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: SCALAR Tagger CI

on:
push:
branches: [ master, develop ]
branches: [ master, develop, distilbert ]
pull_request:
branches: [ master, develop ]

Expand Down Expand Up @@ -78,12 +78,12 @@ jobs:

- name: Start tagger server
run: |
./main -r &
python main --mode run --model_type lm_based &

# Wait for up to 5 minutes for the service to start and load models
timeout=300
while [ $timeout -gt 0 ]; do
if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
if curl -s "http://localhost:8080/numberArray/DECLARATION" > /dev/null; then
echo "Service is ready"
break
fi
Expand All @@ -101,7 +101,7 @@ jobs:

- name: Test tagger endpoint
run: |
response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
response=$(curl -s "http://localhost:8080/numberArray/DECLARATION")
if [ -z "$response" ]; then
echo "No response from tagger"
exit 1
Expand Down
9 changes: 1 addition & 8 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,18 +1,11 @@
FROM python:3.12-slim

#argument to enable GPU accelaration
ARG GPU=false

# Install (and build) requirements
COPY requirements.txt /requirements.txt
COPY requirements_gpu.txt /requirements_gpu.txt
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && \
apt-get update --fix-missing && \
apt-get install --allow-unauthenticated -y git curl && \
pip install -r requirements.txt && \
if [ "$GPU" = true ]; then \
pip install -r requirements_gpu.txt; \
fi && \
apt-get clean && rm -rf /var/lib/apt/lists/*

COPY . .
Expand Down Expand Up @@ -77,6 +70,6 @@ CMD date; \
fi; \
date; \
echo "Running..."; \
/main -r --words words/abbreviationList.csv
/main --mode train --model_type lm_based --words words/abbreviationList.csv

ENV TZ=US/Michigan
252 changes: 135 additions & 117 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,162 +1,180 @@
# SCALAR Part-of-speech tagger
This the official release of the SCALAR Part-of-speech tagger
# SCALAR Part-of-Speech Tagger for Identifiers

There are two ways to run the tagger. This document describes both ways.
**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:

1. Using Docker compose (which runs the tagger's built-in server for you)
2. Running the tagger's built-in server without Docker
- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
- Legacy Gradient Boosting model (for compatibility)

## Current Metrics (this will be updated every time we update/change the model!)
| | Accuracy | Balanced Accuracy | Weighted Recall | Weighted Precision | Weighted F1 | Performance (seconds) |
|------------|:--------:|:------------------:|:---------------:|:------------------:|:-----------:|:---------------------:|
| **SCALAR** | **0.8216** | **0.9160** | **0.8216** | **0.8245** | **0.8220** | **249.05** |
| Ensemble | 0.7124 | 0.8311 | 0.7124 | 0.7597 | 0.7235 | 1149.44 |
| Flair | 0.6087 | 0.7844 | 0.6087 | 0.7755 | 0.6497 | 807.03 |
---

## Getting Started with Docker
## Installation

To run SCALAR in a Docker container you can clone the repository and pull the latest docker impage from `sourceslicer/scalar_tagger:latest`
Make sure you have `python3.12` installed. Then:

Make sure you have Docker and Docker Compose installed:
```bash
git clone https://github.com/SCANL/scanl_tagger.git
cd scanl_tagger
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

---

## Usage

https://docs.docker.com/engine/install/
You can run SCALAR in multiple ways:

https://docs.docker.com/compose/install/
### CLI (with DistilBERT or GradientBoosting model)

```bash
python main --mode run --model_type lm_based # DistilBERT (recommended)
python main --mode run --model_type tree_based # Legacy model
```
git clone [email protected]:SCANL/scanl_tagger.git
cd scanl_tagger
docker compose pull
docker compose up

Then query like:

```
http://127.0.0.1:8080/GetValue/FUNCTION
```

## Getting Started without Docker
You will need `python3.12` installed.
Supports context types:
- FUNCTION
- CLASS
- ATTRIBUTE
- DECLARATION
- PARAMETER

---

You'll need to install `pip` -- https://pip.pypa.io/en/stable/installation/
## Training

Set up a virtual environtment: `python -m venv /tmp/tagger` -- feel free to put it somewhere else (change /tmp/tagger) if you prefer
You can retrain either model (default parameters are currently hardcoded):

Activate the virtual environment: `source /tmp/tagger/bin/activate` (you can find how to activate it here if `source` does not work for you -- https://docs.python.org/3/library/venv.html#how-venvs-work)
```bash
python main --mode train --model_type lm_based
python main --mode train --model_type tree_based
```

After it's installed and your virtual environment is activated, in the root of the repo, run `pip install -r requirements.txt`
---

Finally, we require the `token` and `target` vectors from [code2vec](https://github.com/tech-srl/code2vec). The tagger will attempt to automatically download them if it doesn't find them, but you could download them yourself if you like. It will place them in your local directory under `./code2vec/*`
## Evaluation Results

## Usage
### DistilBERT (LM-Based Model) — Recommended

```
usage: main [-h] [-v] [-r] [-t] [-a ADDRESS] [--port PORT] [--protocol PROTOCOL]
[--words WORDS]

options:
-h, --help show this help message and exit
-v, --version print tagger application version
-r, --run run server for part of speech tagging requests
-t, --train run training set to retrain the model
-a ADDRESS, --address ADDRESS
configure server address
--port PORT configure server port
--protocol PROTOCOL configure whether the server uses http or https
--words WORDS provide path to a list of acceptable abbreviations
```
| Metric | Score |
|--------------------------|---------|
| **Macro F1** | 0.9032 |
| **Token Accuracy** | 0.9223 |
| **Identifier Accuracy** | 0.8291 |

`./main -r` will start the server, which will listen for identifier names sent via HTTP over the route:
| Label | Precision | Recall | F1 | Support |
|-------|-----------|--------|-------|---------|
| CJ | 0.88 | 0.88 | 0.88 | 8 |
| D | 0.98 | 0.96 | 0.97 | 52 |
| DT | 0.95 | 0.93 | 0.94 | 45 |
| N | 0.94 | 0.94 | 0.94 | 418 |
| NM | 0.91 | 0.93 | 0.92 | 440 |
| NPL | 0.97 | 0.97 | 0.97 | 79 |
| P | 0.94 | 0.92 | 0.93 | 79 |
| PRE | 0.79 | 0.79 | 0.79 | 68 |
| V | 0.89 | 0.84 | 0.86 | 110 |
| VM | 0.79 | 0.85 | 0.81 | 13 |

http://127.0.0.1:8080/{identifier_name}/{code_context}/{database_name (optional)}
**Inference Performance:**
- Identifiers/sec: 225.8

"database name" specifies an sqlite database to be used for result caching and data collection. If the database specified does not exist, one will be created.
---

You can check wehther or not a database exists by using the `/probe` route by sending an HTTP request like this:
### Gradient Boost Model (Legacy)

http://127.0.0.1:5000/probe/{database_name}
| Metric | Score |
|----------------------|-----------|
| Accuracy | 0.8216 |
| Balanced Accuracy | 0.9160 |
| Weighted Recall | 0.8216 |
| Weighted Precision | 0.8245 |
| Weighted F1 | 0.8220 |
| Inference Time | 249.05s |

"code context" is one of:
- FUNCTION
- ATTRIBUTE
- CLASS
- DECLARATION
- PARAMETER
**Inference Performance:**
- Identifiers/sec: 8.6

For example:
---

Tag a declaration: ``http://127.0.0.1:8000/numberArray/DECLARATION/database``
## Supported Tagset

Tag a function: ``http://127.0.0.1:8000/GetNumberArray/FUNCTION/database``
| Tag | Meaning | Examples |
|-------|------------------------------------|--------------------------------|
| N | Noun | `user`, `Data`, `Array` |
| DT | Determiner | `this`, `that`, `those` |
| CJ | Conjunction | `and`, `or`, `but` |
| P | Preposition | `with`, `for`, `in` |
| NPL | Plural Noun | `elements`, `indices` |
| NM | Noun Modifier (adjective-like) | `max`, `total`, `employee` |
| V | Verb | `get`, `set`, `delete` |
| VM | Verb Modifier (adverb-like) | `quickly`, `deeply` |
| D | Digit | `1`, `2`, `10`, `0xAF` |
| PRE | Preamble / Prefix | `m`, `b`, `GL`, `p` |

Tag an class: ``http://127.0.0.1:8000/PersonRecord/CLASS/database``
---

#### Note
Kebab case is not currently supported due to the limitations of Spiral. Attempting to send the tagger identifiers which are in kebab case will result in the entry of a single noun.
## Docker Support (Legacy only)

You will need to have a way to parse code and filter out identifier names if you want to do some on-the-fly analysis of source code. We recommend [srcML](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information.
For the legacy server, you can also use Docker:

```bash
docker compose pull
docker compose up
```

---

## Tagset
## Notes

**Supported Tagset**
| Abbreviation | Expanded Form | Examples |
|:------------:|:--------------------------------------------:|:--------------------------------------------:|
| N | noun | Disneyland, shoe, faucet, mother |
| DT | determiner | the, this, that, these, those, which |
| CJ | conjunction | and, for, nor, but, or, yet, so |
| P | preposition | behind, in front of, at, under, above |
| NPL | noun plural | Streets, cities, cars, people, lists |
| NM | noun modifier (**noun-adjunct**, adjective) | red, cold, hot, **bit**Set, **employee**Name |
| V | verb | Run, jump, spin, |
| VM | verb modifier (adverb) | Very, loudly, seriously, impatiently |
| D | digit | 1, 2, 10, 4.12, 0xAF |
| PRE | preamble | Gimp, GLEW, GL, G, p, m, b |
- **Kebab case** is not supported (e.g., `do-something-cool`).
- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.

**Penn Treebank to SCALAR tagset**
---

| Penn Treebank Annotation | SCALAR Tagset |
|:---------------------------:|:------------------------:|
| Conjunction (CC) | Conjunction (CJ) |
| Digit (CD) | Digit (D) |
| Determiner (DT) | Determiner (DT) |
| Foreign Word (FW) | Noun (N) |
| Preposition (IN) | Preposition (P) |
| Adjective (JJ) | Noun Modifier (NM) |
| Comparative Adjective (JJR) | Noun Modifier (NM) |
| Superlative Adjective (JJS) | Noun Modifier (NM) |
| List Item (LS) | Noun (N) |
| Modal (MD) | Verb (V) |
| Noun Singular (NN) | Noun (N) |
| Proper Noun (NNP) | Noun (N) |
| Proper Noun Plural (NNPS) | Noun Plural (NPL) |
| Noun Plural (NNS) | Noun Plural (NPL) |
| Adverb (RB) | Verb Modifier (VM) |
| Comparative Adverb (RBR) | Verb Modifier (VM) |
| Particle (RP) | Verb Modifier (VM) |
| Symbol (SYM) | Noun (N) |
| To Preposition (TO) | Preposition (P) |
| Verb (VB) | Verb (V) |
| Verb (VBD) | Verb (V) |
| Verb (VBG) | Verb (V) |
| Verb (VBN) | Verb (V) |
| Verb (VBP) | Verb (V) |
| Verb (VBZ) | Verb (V) |
## Citations

Please cite:

```
@inproceedings{newman2025scalar,
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
booktitle = {ICPC Tool Demonstrations Track},
year = {2025}
}

@article{newman2021ensemble,
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
journal={IEEE Transactions on Software Engineering},
year={2021},
doi={10.1109/TSE.2021.3098242}
}
```

## Training the tagger
You can train this tagger using the `-t` option (which will re-run the training routine). For the moment, most of this is hard-coded in, so if you want to use a different data set/different seeds, you'll need to modify the code. This will potentially change in the future.
---

## Errors?
Please make an issue if you run into errors
## Training Data

# Please Cite the Paper(s)!
You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)

Newman, Christian, Scholten , Brandon, Testa, Sophia, Behler, Joshua, Banabilah, Syreen, Collard, Michael L., Decker, Michael, Mkaouer, Mohamed Wiem, Zampieri, Marcos, Alomar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I., (2025), “SCALAR: A Part-of-speech Tagger for Identifiers”, in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension - Tool Demonstrations Track (ICPC), Ottawa, ON, Canada, April 27 -28, 5 pages TO APPEAR.
---

Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
## More from SCANL

# Training set
The data used to train this tagger can be found in the most recent database update in the repo -- https://github.com/SCANL/scanl_tagger/blob/master/input/scanl_tagger_training_db_11_29_2024.db
- [SCANL Website](https://www.scanl.org/)
- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)

# Interested in our other work?
Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
---

# WordNet
This project uses WordNet to perform a dictionary lookup on the individual words in each identifier:
## Trouble?

Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010
Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!
Loading