Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
087c498
Merge pull request #9 from SCANL/develop
cnewman Jan 17, 2025
259dbb3
add split_by_capitals
SyreenBan Feb 6, 2025
a8b4ba3
fix the Splitter
SyreenBan Feb 13, 2025
ee69358
Merge pull request #10 from SCANL/develop
cnewman Feb 16, 2025
d91ed0d
Change ports in the readme
cnewman Feb 16, 2025
cfea42d
Forgot Git uses master and not main. Updated .yml to master.
cnewman Feb 16, 2025
d730df7
Change dockerhub link for now
cnewman Feb 16, 2025
40a72da
Rewrite AppCache to use sqlite
brandonscholten Mar 16, 2025
fb2ab83
Switch to sqlite
brandonscholten Mar 17, 2025
6c36a6a
Finish initial sqlite implementation
brandonscholten Mar 21, 2025
38cbb03
Remove sqlite3 from requirements
brandonscholten Mar 21, 2025
4596af0
Fix bugs
brandonscholten Mar 23, 2025
455d5b5
Add restart always to compose.yml
brandonscholten Mar 23, 2025
bec20b9
Attempt at optional cache, broke everything
brandonscholten Mar 23, 2025
f982cf4
Fix count
Mar 26, 2025
9e62aa7
Fix encounter
Mar 26, 2025
bf8d0e5
Remove use of CacheIndex, add probe route
Mar 29, 2025
503d697
Merge pull request #14 from brandonscholten/5-optional-caching
brandonscholten Apr 17, 2025
0e5df4e
Update documentation
brandonscholten Apr 21, 2025
1312818
Merge pull request #15 from brandonscholten/5-optional-caching
brandonscholten Apr 22, 2025
11e45a7
Create LICENSE
cnewman Apr 24, 2025
041c103
Update README.md
cnewman Apr 24, 2025
23d28e6
Update README.md
cnewman Apr 28, 2025
5a398c5
Update README.md
cnewman Apr 28, 2025
76933c7
Update header level
cnewman Apr 28, 2025
3fce0f6
Resolve merge conflicts
brandonscholten May 3, 2025
cc234d0
Merge branch 'master' into merge-upstream-changes
brandonscholten May 3, 2025
81d140a
Resolve merge conflicts
brandonscholten May 3, 2025
9b7cd2e
Account for context when saving identifiers
brandonscholten May 6, 2025
dc1222c
Save time to tag an identifier in database
brandonscholten May 6, 2025
0ac367b
Removed unused dependencies
brandonscholten May 10, 2025
14dd5c1
Optional GPU accelaration in Docker
brandonscholten May 10, 2025
b4c62d1
Separate GPU accelaration dependencies
brandonscholten May 10, 2025
9eaa576
“Update”
brandonscholten May 11, 2025
47dbb3b
Update serve.json
brandonscholten May 18, 2025
1fb0839
Merge pull request #12 from brandonscholten/merge-upstream-changes
cnewman May 24, 2025
73ac7d4
Add new type of tagger
Jun 2, 2025
347ef4e
Prepare to re-add kfold
Jun 3, 2025
2417e49
Load model when server runs, listen for url
cnewman Jun 3, 2025
4dcd8d4
A half vibe coded mess, but I think it works. Needs a ton of clean up.
cnewman Jun 4, 2025
b30e928
Fix bug with the masking
cnewman Jun 4, 2025
a84f3ad
Remove system as a feature
cnewman Jun 4, 2025
ca22c59
Update to pull from huggingface or local based on --local
cnewman Jun 4, 2025
e083b39
Fix requirements and I dunno how the crf imports are working
Jun 4, 2025
2703566
Remove req that won't work on windows
cnewman Jun 4, 2025
e135cd6
Greatly reduce the requirements.txt to just the top level reqs
Jun 4, 2025
68a1466
Merge branch 'distilbert' of github.com:SCANL/scanl_tagger into disti…
Jun 4, 2025
059eeb0
Make it so that classification report gets printed to a file
cnewman Jun 4, 2025
ecc8855
Update readme
Jun 4, 2025
6ea557f
DRY
Jun 5, 2025
dc5c8a4
Remove reliance on NLTK. Does not reduce effectiveness of the model, and
cnewman Jun 8, 2025
f067186
Add current metrics
cnewman Jun 8, 2025
26857a1
Tested tree and lm based run and train. Did some thorough documenting…
Jun 10, 2025
bde70e2
Update readme with new arguments and data
Jun 10, 2025
89748a0
git workflow
Jun 10, 2025
28dd47c
Starting to see if I can get Doker up again. Update requirements with…
Jun 10, 2025
4ed9cd3
add download_files() to lm execution flow for the way we are currentl…
Jun 10, 2025
cd8d94e
Forgot to run process in the bg for github actions
Jun 10, 2025
44748e4
Merge pull request #13 from SCANL/distilbert
cnewman Jun 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ name: SCALAR Tagger CI

on:
push:
branches: [ main, develop ]
branches: [ master, develop, distilbert ]
pull_request:
branches: [ main, develop ]
branches: [ master, develop ]

jobs:
test-docker:
Expand Down Expand Up @@ -78,12 +78,12 @@ jobs:

- name: Start tagger server
run: |
./main -r &
python main --mode run --model_type lm_based &

# Wait for up to 5 minutes for the service to start and load models
timeout=300
while [ $timeout -gt 0 ]; do
if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
if curl -s "http://localhost:8080/numberArray/DECLARATION" > /dev/null; then
echo "Service is ready"
break
fi
Expand All @@ -101,7 +101,7 @@ jobs:

- name: Test tagger endpoint
run: |
response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
response=$(curl -s "http://localhost:8080/numberArray/DECLARATION")
if [ -z "$response" ]; then
echo "No response from tagger"
exit 1
Expand All @@ -112,4 +112,4 @@ jobs:
uses: actions/cache@v3
with:
path: ~/.cache/gensim-data/fasttext-wiki-news-subwords-300*
key: ${{ runner.os }}-fasttext-model
key: ${{ runner.os }}-fasttext-model
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ output/
__pycache__/
code2vec/
cache/
input.txt
input.txt
11 changes: 6 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@ FROM python:3.12-slim

# Install (and build) requirements
COPY requirements.txt /requirements.txt
RUN apt-get update && \
apt-get install -y git curl && \
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && \
apt-get update --fix-missing && \
apt-get install --allow-unauthenticated -y git curl && \
pip install -r requirements.txt && \
rm -rf /var/lib/apt/lists/*
apt-get clean && rm -rf /var/lib/apt/lists/*

COPY . .
RUN pip install -e .
Expand Down Expand Up @@ -69,6 +70,6 @@ CMD date; \
fi; \
date; \
echo "Running..."; \
/main -r --words words/abbreviationList.csv
/main --mode train --model_type lm_based --words words/abbreviationList.csv

ENV TZ=US/Michigan
ENV TZ=US/Michigan
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

246 changes: 137 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,152 +1,180 @@
# SCALAR Part-of-speech tagger
This the official release of the SCALAR Part-of-speech tagger
# SCALAR Part-of-Speech Tagger for Identifiers

There are two ways to run the tagger. This document describes both ways.
**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:

1. Using Docker compose (which runs the tagger's built-in server for you)
2. Running the tagger's built-in server without Docker
- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
- Legacy Gradient Boosting model (for compatibility)

## Getting Started with Docker
---

To run SCNL tagger in a Docker container you can clone the repository and pull the latest docker impage from `srcml/scanl_tagger:latest`
## Installation

Make sure you have Docker and Docker Compose installed:
https://docs.docker.com/engine/install/
https://docs.docker.com/compose/install/
Make sure you have `python3.12` installed. Then:

```
git clone git@github.com:SCANL/scanl_tagger.git
```bash
git clone https://github.com/SCANL/scanl_tagger.git
cd scanl_tagger
docker compose pull
docker compose up
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

## Getting Started without Docker
You will need `python3.12` installed.
---

You'll need to install `pip` -- https://pip.pypa.io/en/stable/installation/

Set up a virtual environtment: `python -m venv /tmp/tagger` -- feel free to put it somewhere else (change /tmp/tagger) if you prefer
## Usage

Activate the virtual environment: `source /tmp/tagger/bin/activate` (you can find how to activate it here if `source` does not work for you -- https://docs.python.org/3/library/venv.html#how-venvs-work)
You can run SCALAR in multiple ways:

After it's installed and your virtual environment is activated, in the root of the repo, run `pip install -r requirements.txt`
### CLI (with DistilBERT or GradientBoosting model)

Finally, we require the `token` and `target` vectors from [code2vec](https://github.com/tech-srl/code2vec). The tagger will attempt to automatically download them if it doesn't find them, but you could download them yourself if you like. It will place them in your local directory under `./code2vec/*`
```bash
python main --mode run --model_type lm_based # DistilBERT (recommended)
python main --mode run --model_type tree_based # Legacy model
```

## Usage
Then query like:

```
usage: main [-h] [-v] [-r] [-t] [-a ADDRESS] [--port PORT] [--protocol PROTOCOL]
[--words WORDS]

options:
-h, --help show this help message and exit
-v, --version print tagger application version
-r, --run run server for part of speech tagging requests
-t, --train run training set to retrain the model
-a ADDRESS, --address ADDRESS
configure server address
--port PORT configure server port
--protocol PROTOCOL configure whether the server uses http or https
--words WORDS provide path to a list of acceptable abbreviations
http://127.0.0.1:8080/GetValue/FUNCTION
```

`./main -r` will start the server, which will listen for identifier names sent via HTTP over the route:

http://127.0.0.1:5000/{cache_selection}/{identifier_name}/{code_context}

**NOTE: ** On docker, the port is 8080 instead of 5000.

"cache selection" will save results to a separate cache if it is set to "student"

"code context" is one of:
Supports context types:
- FUNCTION
- ATTRIBUTE
- CLASS
- ATTRIBUTE
- DECLARATION
- PARAMETER

For example:
---

## Training

You can retrain either model (default parameters are currently hardcoded):

```bash
python main --mode train --model_type lm_based
python main --mode train --model_type tree_based
```

---

Tag a declaration: ``http://127.0.0.1:5000/cache/numberArray/DECLARATION``
## Evaluation Results

Tag a function: ``http://127.0.0.1:5000/cache/GetNumberArray/FUNCTION``
### DistilBERT (LM-Based Model) — Recommended

Tag an class: ``http://127.0.0.1:5000/cache/PersonRecord/CLASS``
| Metric | Score |
|--------------------------|---------|
| **Macro F1** | 0.9032 |
| **Token Accuracy** | 0.9223 |
| **Identifier Accuracy** | 0.8291 |

#### Note
Kebab case is not currently supported due to the limitations of Spiral. Attempting to send the tagger identifiers which are in kebab case will result in the entry of a single noun.
| Label | Precision | Recall | F1 | Support |
|-------|-----------|--------|-------|---------|
| CJ | 0.88 | 0.88 | 0.88 | 8 |
| D | 0.98 | 0.96 | 0.97 | 52 |
| DT | 0.95 | 0.93 | 0.94 | 45 |
| N | 0.94 | 0.94 | 0.94 | 418 |
| NM | 0.91 | 0.93 | 0.92 | 440 |
| NPL | 0.97 | 0.97 | 0.97 | 79 |
| P | 0.94 | 0.92 | 0.93 | 79 |
| PRE | 0.79 | 0.79 | 0.79 | 68 |
| V | 0.89 | 0.84 | 0.86 | 110 |
| VM | 0.79 | 0.85 | 0.81 | 13 |

You will need to have a way to parse code and filter out identifier names if you want to do some on-the-fly analysis of source code. We recommend [srcML](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information.
**Inference Performance:**
- Identifiers/sec: 225.8

---

## Tagset
### Gradient Boost Model (Legacy)

**Supported Tagset**
| Abbreviation | Expanded Form | Examples |
|:------------:|:--------------------------------------------:|:--------------------------------------------:|
| N | noun | Disneyland, shoe, faucet, mother |
| DT | determiner | the, this, that, these, those, which |
| CJ | conjunction | and, for, nor, but, or, yet, so |
| P | preposition | behind, in front of, at, under, above |
| NPL | noun plural | Streets, cities, cars, people, lists |
| NM | noun modifier (**noun-adjunct**, adjective) | red, cold, hot, **bit**Set, **employee**Name |
| V | verb | Run, jump, spin, |
| VM | verb modifier (adverb) | Very, loudly, seriously, impatiently |
| D | digit | 1, 2, 10, 4.12, 0xAF |
| PRE | preamble | Gimp, GLEW, GL, G, p, m, b |
| Metric | Score |
|----------------------|-----------|
| Accuracy | 0.8216 |
| Balanced Accuracy | 0.9160 |
| Weighted Recall | 0.8216 |
| Weighted Precision | 0.8245 |
| Weighted F1 | 0.8220 |
| Inference Time | 249.05s |

**Penn Treebank to SCALAR tagset**
**Inference Performance:**
- Identifiers/sec: 8.6

| Penn Treebank Annotation | SCALAR Tagset |
|:---------------------------:|:------------------------:|
| Conjunction (CC) | Conjunction (CJ) |
| Digit (CD) | Digit (D) |
| Determiner (DT) | Determiner (DT) |
| Foreign Word (FW) | Noun (N) |
| Preposition (IN) | Preposition (P) |
| Adjective (JJ) | Noun Modifier (NM) |
| Comparative Adjective (JJR) | Noun Modifier (NM) |
| Superlative Adjective (JJS) | Noun Modifier (NM) |
| List Item (LS) | Noun (N) |
| Modal (MD) | Verb (V) |
| Noun Singular (NN) | Noun (N) |
| Proper Noun (NNP) | Noun (N) |
| Proper Noun Plural (NNPS) | Noun Plural (NPL) |
| Noun Plural (NNS) | Noun Plural (NPL) |
| Adverb (RB) | Verb Modifier (VM) |
| Comparative Adverb (RBR) | Verb Modifier (VM) |
| Particle (RP) | Verb Modifier (VM) |
| Symbol (SYM) | Noun (N) |
| To Preposition (TO) | Preposition (P) |
| Verb (VB) | Verb (V) |
| Verb (VBD) | Verb (V) |
| Verb (VBG) | Verb (V) |
| Verb (VBN) | Verb (V) |
| Verb (VBP) | Verb (V) |
| Verb (VBZ) | Verb (V) |
---

## Training the tagger
You can train this tagger using the `-t` option (which will re-run the training routine). For the moment, most of this is hard-coded in, so if you want to use a different data set/different seeds, you'll need to modify the code. This will potentially change in the future.
## Supported Tagset

| Tag | Meaning | Examples |
|-------|------------------------------------|--------------------------------|
| N | Noun | `user`, `Data`, `Array` |
| DT | Determiner | `this`, `that`, `those` |
| CJ | Conjunction | `and`, `or`, `but` |
| P | Preposition | `with`, `for`, `in` |
| NPL | Plural Noun | `elements`, `indices` |
| NM | Noun Modifier (adjective-like) | `max`, `total`, `employee` |
| V | Verb | `get`, `set`, `delete` |
| VM | Verb Modifier (adverb-like) | `quickly`, `deeply` |
| D | Digit | `1`, `2`, `10`, `0xAF` |
| PRE | Preamble / Prefix | `m`, `b`, `GL`, `p` |

---

## Docker Support (Legacy only)

For the legacy server, you can also use Docker:

```bash
docker compose pull
docker compose up
```

---

## Notes

- **Kebab case** is not supported (e.g., `do-something-cool`).
- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.

---

## Citations

Please cite:

```
@inproceedings{newman2025scalar,
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
booktitle = {ICPC Tool Demonstrations Track},
year = {2025}
}

@article{newman2021ensemble,
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
journal={IEEE Transactions on Software Engineering},
year={2021},
doi={10.1109/TSE.2021.3098242}
}
```

## Errors?
Please make an issue if you run into errors
---

# Please Cite the Paper(s)!
## Training Data

Newman, Christian, Scholten , Brandon, Testa, Sophia, Behler, Joshua, Banabilah, Syreen, Collard, Michael L., Decker, Michael, Mkaouer, Mohamed Wiem, Zampieri, Marcos, Alomar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I., (2025), “SCALAR: A Part-of-speech Tagger for Identifiers”, in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension - Tool Demonstrations Track (ICPC), Ottawa, ON, Canada, April 27 -28, 5 pages TO APPEAR.
You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)

Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
---

# Training set
The data used to train this tagger can be found in the most recent database update in the repo -- https://github.com/SCANL/scanl_tagger/blob/master/input/scanl_tagger_training_db_11_29_2024.db
## More from SCANL

# Interested in our other work?
Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
- [SCANL Website](https://www.scanl.org/)
- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)

# WordNet
This project uses WordNet to perform a dictionary lookup on the individual words in each identifier:
---

Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010
## Trouble?

Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!
1 change: 1 addition & 0 deletions compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ services:
- words:/words
ports:
- "${PORT-8080}:5000"
restart: always
Loading