Skip to content

Commit b46867a

Browse files
authored
Merge pull request #14 from SCANL/master
sync master and dev
2 parents 4980058 + 44748e4 commit b46867a

23 files changed

+4793
-576
lines changed

.github/workflows/tests.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@ name: SCALAR Tagger CI
22

33
on:
44
push:
5-
branches: [ main, develop ]
5+
branches: [ master, develop, distilbert ]
66
pull_request:
7-
branches: [ main, develop ]
7+
branches: [ master, develop ]
88

99
jobs:
1010
test-docker:
@@ -78,12 +78,12 @@ jobs:
7878
7979
- name: Start tagger server
8080
run: |
81-
./main -r &
81+
python main --mode run --model_type lm_based &
8282
8383
# Wait for up to 5 minutes for the service to start and load models
8484
timeout=300
8585
while [ $timeout -gt 0 ]; do
86-
if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
86+
if curl -s "http://localhost:8080/numberArray/DECLARATION" > /dev/null; then
8787
echo "Service is ready"
8888
break
8989
fi
@@ -101,7 +101,7 @@ jobs:
101101
102102
- name: Test tagger endpoint
103103
run: |
104-
response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
104+
response=$(curl -s "http://localhost:8080/numberArray/DECLARATION")
105105
if [ -z "$response" ]; then
106106
echo "No response from tagger"
107107
exit 1
@@ -112,4 +112,4 @@ jobs:
112112
uses: actions/cache@v3
113113
with:
114114
path: ~/.cache/gensim-data/fasttext-wiki-news-subwords-300*
115-
key: ${{ runner.os }}-fasttext-model
115+
key: ${{ runner.os }}-fasttext-model

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@ output/
22
__pycache__/
33
code2vec/
44
cache/
5-
input.txt
5+
input.txt

Dockerfile

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@ FROM python:3.12-slim
22

33
# Install (and build) requirements
44
COPY requirements.txt /requirements.txt
5-
RUN apt-get update && \
6-
apt-get install -y git curl && \
5+
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && \
6+
apt-get update --fix-missing && \
7+
apt-get install --allow-unauthenticated -y git curl && \
78
pip install -r requirements.txt && \
8-
rm -rf /var/lib/apt/lists/*
9+
apt-get clean && rm -rf /var/lib/apt/lists/*
910

1011
COPY . .
1112
RUN pip install -e .
@@ -69,6 +70,6 @@ CMD date; \
6970
fi; \
7071
date; \
7172
echo "Running..."; \
72-
/main -r --words words/abbreviationList.csv
73+
/main --mode train --model_type lm_based --words words/abbreviationList.csv
7374

74-
ENV TZ=US/Michigan
75+
ENV TZ=US/Michigan

LICENSE

Lines changed: 674 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 137 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -1,152 +1,180 @@
1-
# SCALAR Part-of-speech tagger
2-
This the official release of the SCALAR Part-of-speech tagger
1+
# SCALAR Part-of-Speech Tagger for Identifiers
32

4-
There are two ways to run the tagger. This document describes both ways.
3+
**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:
54

6-
1. Using Docker compose (which runs the tagger's built-in server for you)
7-
2. Running the tagger's built-in server without Docker
5+
- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
6+
- Legacy Gradient Boosting model (for compatibility)
87

9-
## Getting Started with Docker
8+
---
109

11-
To run SCNL tagger in a Docker container you can clone the repository and pull the latest docker impage from `srcml/scanl_tagger:latest`
10+
## Installation
1211

13-
Make sure you have Docker and Docker Compose installed:
14-
https://docs.docker.com/engine/install/
15-
https://docs.docker.com/compose/install/
12+
Make sure you have `python3.12` installed. Then:
1613

17-
```
18-
git clone git@github.com:SCANL/scanl_tagger.git
14+
```bash
15+
git clone https://github.com/SCANL/scanl_tagger.git
1916
cd scanl_tagger
20-
docker compose pull
21-
docker compose up
17+
python -m venv venv
18+
source venv/bin/activate
19+
pip install -r requirements.txt
2220
```
2321

24-
## Getting Started without Docker
25-
You will need `python3.12` installed.
22+
---
2623

27-
You'll need to install `pip` -- https://pip.pypa.io/en/stable/installation/
28-
29-
Set up a virtual environtment: `python -m venv /tmp/tagger` -- feel free to put it somewhere else (change /tmp/tagger) if you prefer
24+
## Usage
3025

31-
Activate the virtual environment: `source /tmp/tagger/bin/activate` (you can find how to activate it here if `source` does not work for you -- https://docs.python.org/3/library/venv.html#how-venvs-work)
26+
You can run SCALAR in multiple ways:
3227

33-
After it's installed and your virtual environment is activated, in the root of the repo, run `pip install -r requirements.txt`
28+
### CLI (with DistilBERT or GradientBoosting model)
3429

35-
Finally, we require the `token` and `target` vectors from [code2vec](https://github.com/tech-srl/code2vec). The tagger will attempt to automatically download them if it doesn't find them, but you could download them yourself if you like. It will place them in your local directory under `./code2vec/*`
30+
```bash
31+
python main --mode run --model_type lm_based # DistilBERT (recommended)
32+
python main --mode run --model_type tree_based # Legacy model
33+
```
3634

37-
## Usage
35+
Then query like:
3836

3937
```
40-
usage: main [-h] [-v] [-r] [-t] [-a ADDRESS] [--port PORT] [--protocol PROTOCOL]
41-
[--words WORDS]
42-
43-
options:
44-
-h, --help show this help message and exit
45-
-v, --version print tagger application version
46-
-r, --run run server for part of speech tagging requests
47-
-t, --train run training set to retrain the model
48-
-a ADDRESS, --address ADDRESS
49-
configure server address
50-
--port PORT configure server port
51-
--protocol PROTOCOL configure whether the server uses http or https
52-
--words WORDS provide path to a list of acceptable abbreviations
38+
http://127.0.0.1:8080/GetValue/FUNCTION
5339
```
5440

55-
`./main -r` will start the server, which will listen for identifier names sent via HTTP over the route:
56-
57-
http://127.0.0.1:5000/{cache_selection}/{identifier_name}/{code_context}
58-
59-
**NOTE: ** On docker, the port is 8080 instead of 5000.
60-
61-
"cache selection" will save results to a separate cache if it is set to "student"
62-
63-
"code context" is one of:
41+
Supports context types:
6442
- FUNCTION
65-
- ATTRIBUTE
6643
- CLASS
44+
- ATTRIBUTE
6745
- DECLARATION
6846
- PARAMETER
6947

70-
For example:
48+
---
49+
50+
## Training
51+
52+
You can retrain either model (default parameters are currently hardcoded):
53+
54+
```bash
55+
python main --mode train --model_type lm_based
56+
python main --mode train --model_type tree_based
57+
```
58+
59+
---
7160

72-
Tag a declaration: ``http://127.0.0.1:5000/cache/numberArray/DECLARATION``
61+
## Evaluation Results
7362

74-
Tag a function: ``http://127.0.0.1:5000/cache/GetNumberArray/FUNCTION``
63+
### DistilBERT (LM-Based Model) — Recommended
7564

76-
Tag an class: ``http://127.0.0.1:5000/cache/PersonRecord/CLASS``
65+
| Metric | Score |
66+
|--------------------------|---------|
67+
| **Macro F1** | 0.9032 |
68+
| **Token Accuracy** | 0.9223 |
69+
| **Identifier Accuracy** | 0.8291 |
7770

78-
#### Note
79-
Kebab case is not currently supported due to the limitations of Spiral. Attempting to send the tagger identifiers which are in kebab case will result in the entry of a single noun.
71+
| Label | Precision | Recall | F1 | Support |
72+
|-------|-----------|--------|-------|---------|
73+
| CJ | 0.88 | 0.88 | 0.88 | 8 |
74+
| D | 0.98 | 0.96 | 0.97 | 52 |
75+
| DT | 0.95 | 0.93 | 0.94 | 45 |
76+
| N | 0.94 | 0.94 | 0.94 | 418 |
77+
| NM | 0.91 | 0.93 | 0.92 | 440 |
78+
| NPL | 0.97 | 0.97 | 0.97 | 79 |
79+
| P | 0.94 | 0.92 | 0.93 | 79 |
80+
| PRE | 0.79 | 0.79 | 0.79 | 68 |
81+
| V | 0.89 | 0.84 | 0.86 | 110 |
82+
| VM | 0.79 | 0.85 | 0.81 | 13 |
8083

81-
You will need to have a way to parse code and filter out identifier names if you want to do some on-the-fly analysis of source code. We recommend [srcML](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information.
84+
**Inference Performance:**
85+
- Identifiers/sec: 225.8
8286

87+
---
8388

84-
## Tagset
89+
### Gradient Boost Model (Legacy)
8590

86-
**Supported Tagset**
87-
| Abbreviation | Expanded Form | Examples |
88-
|:------------:|:--------------------------------------------:|:--------------------------------------------:|
89-
| N | noun | Disneyland, shoe, faucet, mother |
90-
| DT | determiner | the, this, that, these, those, which |
91-
| CJ | conjunction | and, for, nor, but, or, yet, so |
92-
| P | preposition | behind, in front of, at, under, above |
93-
| NPL | noun plural | Streets, cities, cars, people, lists |
94-
| NM | noun modifier (**noun-adjunct**, adjective) | red, cold, hot, **bit**Set, **employee**Name |
95-
| V | verb | Run, jump, spin, |
96-
| VM | verb modifier (adverb) | Very, loudly, seriously, impatiently |
97-
| D | digit | 1, 2, 10, 4.12, 0xAF |
98-
| PRE | preamble | Gimp, GLEW, GL, G, p, m, b |
91+
| Metric | Score |
92+
|----------------------|-----------|
93+
| Accuracy | 0.8216 |
94+
| Balanced Accuracy | 0.9160 |
95+
| Weighted Recall | 0.8216 |
96+
| Weighted Precision | 0.8245 |
97+
| Weighted F1 | 0.8220 |
98+
| Inference Time | 249.05s |
9999

100-
**Penn Treebank to SCALAR tagset**
100+
**Inference Performance:**
101+
- Identifiers/sec: 8.6
101102

102-
| Penn Treebank Annotation | SCALAR Tagset |
103-
|:---------------------------:|:------------------------:|
104-
| Conjunction (CC) | Conjunction (CJ) |
105-
| Digit (CD) | Digit (D) |
106-
| Determiner (DT) | Determiner (DT) |
107-
| Foreign Word (FW) | Noun (N) |
108-
| Preposition (IN) | Preposition (P) |
109-
| Adjective (JJ) | Noun Modifier (NM) |
110-
| Comparative Adjective (JJR) | Noun Modifier (NM) |
111-
| Superlative Adjective (JJS) | Noun Modifier (NM) |
112-
| List Item (LS) | Noun (N) |
113-
| Modal (MD) | Verb (V) |
114-
| Noun Singular (NN) | Noun (N) |
115-
| Proper Noun (NNP) | Noun (N) |
116-
| Proper Noun Plural (NNPS) | Noun Plural (NPL) |
117-
| Noun Plural (NNS) | Noun Plural (NPL) |
118-
| Adverb (RB) | Verb Modifier (VM) |
119-
| Comparative Adverb (RBR) | Verb Modifier (VM) |
120-
| Particle (RP) | Verb Modifier (VM) |
121-
| Symbol (SYM) | Noun (N) |
122-
| To Preposition (TO) | Preposition (P) |
123-
| Verb (VB) | Verb (V) |
124-
| Verb (VBD) | Verb (V) |
125-
| Verb (VBG) | Verb (V) |
126-
| Verb (VBN) | Verb (V) |
127-
| Verb (VBP) | Verb (V) |
128-
| Verb (VBZ) | Verb (V) |
103+
---
129104

130-
## Training the tagger
131-
You can train this tagger using the `-t` option (which will re-run the training routine). For the moment, most of this is hard-coded in, so if you want to use a different data set/different seeds, you'll need to modify the code. This will potentially change in the future.
105+
## Supported Tagset
106+
107+
| Tag | Meaning | Examples |
108+
|-------|------------------------------------|--------------------------------|
109+
| N | Noun | `user`, `Data`, `Array` |
110+
| DT | Determiner | `this`, `that`, `those` |
111+
| CJ | Conjunction | `and`, `or`, `but` |
112+
| P | Preposition | `with`, `for`, `in` |
113+
| NPL | Plural Noun | `elements`, `indices` |
114+
| NM | Noun Modifier (adjective-like) | `max`, `total`, `employee` |
115+
| V | Verb | `get`, `set`, `delete` |
116+
| VM | Verb Modifier (adverb-like) | `quickly`, `deeply` |
117+
| D | Digit | `1`, `2`, `10`, `0xAF` |
118+
| PRE | Preamble / Prefix | `m`, `b`, `GL`, `p` |
119+
120+
---
121+
122+
## Docker Support (Legacy only)
123+
124+
For the legacy server, you can also use Docker:
125+
126+
```bash
127+
docker compose pull
128+
docker compose up
129+
```
130+
131+
---
132+
133+
## Notes
134+
135+
- **Kebab case** is not supported (e.g., `do-something-cool`).
136+
- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
137+
- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
138+
- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.
139+
140+
---
141+
142+
## Citations
143+
144+
Please cite:
145+
146+
```
147+
@inproceedings{newman2025scalar,
148+
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
149+
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
150+
booktitle = {ICPC Tool Demonstrations Track},
151+
year = {2025}
152+
}
153+
154+
@article{newman2021ensemble,
155+
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
156+
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
157+
journal={IEEE Transactions on Software Engineering},
158+
year={2021},
159+
doi={10.1109/TSE.2021.3098242}
160+
}
161+
```
132162

133-
## Errors?
134-
Please make an issue if you run into errors
163+
---
135164

136-
# Please Cite the Paper(s)!
165+
## Training Data
137166

138-
Newman, Christian, Scholten , Brandon, Testa, Sophia, Behler, Joshua, Banabilah, Syreen, Collard, Michael L., Decker, Michael, Mkaouer, Mohamed Wiem, Zampieri, Marcos, Alomar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I., (2025), “SCALAR: A Part-of-speech Tagger for Identifiers”, in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension - Tool Demonstrations Track (ICPC), Ottawa, ON, Canada, April 27 -28, 5 pages TO APPEAR.
167+
You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)
139168

140-
Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
169+
---
141170

142-
# Training set
143-
The data used to train this tagger can be found in the most recent database update in the repo -- https://github.com/SCANL/scanl_tagger/blob/master/input/scanl_tagger_training_db_11_29_2024.db
171+
## More from SCANL
144172

145-
# Interested in our other work?
146-
Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
173+
- [SCANL Website](https://www.scanl.org/)
174+
- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
147175

148-
# WordNet
149-
This project uses WordNet to perform a dictionary lookup on the individual words in each identifier:
176+
---
150177

151-
Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010
178+
## Trouble?
152179

180+
Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!

compose.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,4 @@ services:
2020
- words:/words
2121
ports:
2222
- "${PORT-8080}:5000"
23+
restart: always

0 commit comments

Comments
 (0)