Skip to content

Commit 44748e4

Browse files
authored
Merge pull request #13 from SCANL/distilbert
New LM-based tagger
2 parents 1fb0839 + cd8d94e commit 44748e4

20 files changed

+3991
-539
lines changed

.github/workflows/tests.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: SCALAR Tagger CI
22

33
on:
44
push:
5-
branches: [ master, develop ]
5+
branches: [ master, develop, distilbert ]
66
pull_request:
77
branches: [ master, develop ]
88

@@ -78,12 +78,12 @@ jobs:
7878
7979
- name: Start tagger server
8080
run: |
81-
./main -r &
81+
python main --mode run --model_type lm_based &
8282
8383
# Wait for up to 5 minutes for the service to start and load models
8484
timeout=300
8585
while [ $timeout -gt 0 ]; do
86-
if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
86+
if curl -s "http://localhost:8080/numberArray/DECLARATION" > /dev/null; then
8787
echo "Service is ready"
8888
break
8989
fi
@@ -101,7 +101,7 @@ jobs:
101101
102102
- name: Test tagger endpoint
103103
run: |
104-
response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
104+
response=$(curl -s "http://localhost:8080/numberArray/DECLARATION")
105105
if [ -z "$response" ]; then
106106
echo "No response from tagger"
107107
exit 1

Dockerfile

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,11 @@
11
FROM python:3.12-slim
22

3-
#argument to enable GPU accelaration
4-
ARG GPU=false
5-
63
# Install (and build) requirements
74
COPY requirements.txt /requirements.txt
8-
COPY requirements_gpu.txt /requirements_gpu.txt
95
RUN apt-get clean && rm -rf /var/lib/apt/lists/* && \
106
apt-get update --fix-missing && \
117
apt-get install --allow-unauthenticated -y git curl && \
128
pip install -r requirements.txt && \
13-
if [ "$GPU" = true ]; then \
14-
pip install -r requirements_gpu.txt; \
15-
fi && \
169
apt-get clean && rm -rf /var/lib/apt/lists/*
1710

1811
COPY . .
@@ -77,6 +70,6 @@ CMD date; \
7770
fi; \
7871
date; \
7972
echo "Running..."; \
80-
/main -r --words words/abbreviationList.csv
73+
/main --mode train --model_type lm_based --words words/abbreviationList.csv
8174

8275
ENV TZ=US/Michigan

README.md

Lines changed: 135 additions & 117 deletions
Original file line numberDiff line numberDiff line change
@@ -1,162 +1,180 @@
1-
# SCALAR Part-of-speech tagger
2-
This the official release of the SCALAR Part-of-speech tagger
1+
# SCALAR Part-of-Speech Tagger for Identifiers
32

4-
There are two ways to run the tagger. This document describes both ways.
3+
**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:
54

6-
1. Using Docker compose (which runs the tagger's built-in server for you)
7-
2. Running the tagger's built-in server without Docker
5+
- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
6+
- Legacy Gradient Boosting model (for compatibility)
87

9-
## Current Metrics (this will be updated every time we update/change the model!)
10-
| | Accuracy | Balanced Accuracy | Weighted Recall | Weighted Precision | Weighted F1 | Performance (seconds) |
11-
|------------|:--------:|:------------------:|:---------------:|:------------------:|:-----------:|:---------------------:|
12-
| **SCALAR** | **0.8216** | **0.9160** | **0.8216** | **0.8245** | **0.8220** | **249.05** |
13-
| Ensemble | 0.7124 | 0.8311 | 0.7124 | 0.7597 | 0.7235 | 1149.44 |
14-
| Flair | 0.6087 | 0.7844 | 0.6087 | 0.7755 | 0.6497 | 807.03 |
8+
---
159

16-
## Getting Started with Docker
10+
## Installation
1711

18-
To run SCALAR in a Docker container you can clone the repository and pull the latest docker impage from `sourceslicer/scalar_tagger:latest`
12+
Make sure you have `python3.12` installed. Then:
1913

20-
Make sure you have Docker and Docker Compose installed:
14+
```bash
15+
git clone https://github.com/SCANL/scanl_tagger.git
16+
cd scanl_tagger
17+
python -m venv venv
18+
source venv/bin/activate
19+
pip install -r requirements.txt
20+
```
21+
22+
---
23+
24+
## Usage
2125

22-
https://docs.docker.com/engine/install/
26+
You can run SCALAR in multiple ways:
2327

24-
https://docs.docker.com/compose/install/
28+
### CLI (with DistilBERT or GradientBoosting model)
2529

30+
```bash
31+
python main --mode run --model_type lm_based # DistilBERT (recommended)
32+
python main --mode run --model_type tree_based # Legacy model
2633
```
27-
git clone [email protected]:SCANL/scanl_tagger.git
28-
cd scanl_tagger
29-
docker compose pull
30-
docker compose up
34+
35+
Then query like:
36+
37+
```
38+
http://127.0.0.1:8080/GetValue/FUNCTION
3139
```
3240

33-
## Getting Started without Docker
34-
You will need `python3.12` installed.
41+
Supports context types:
42+
- FUNCTION
43+
- CLASS
44+
- ATTRIBUTE
45+
- DECLARATION
46+
- PARAMETER
47+
48+
---
3549

36-
You'll need to install `pip` -- https://pip.pypa.io/en/stable/installation/
50+
## Training
3751

38-
Set up a virtual environtment: `python -m venv /tmp/tagger` -- feel free to put it somewhere else (change /tmp/tagger) if you prefer
52+
You can retrain either model (default parameters are currently hardcoded):
3953

40-
Activate the virtual environment: `source /tmp/tagger/bin/activate` (you can find how to activate it here if `source` does not work for you -- https://docs.python.org/3/library/venv.html#how-venvs-work)
54+
```bash
55+
python main --mode train --model_type lm_based
56+
python main --mode train --model_type tree_based
57+
```
4158

42-
After it's installed and your virtual environment is activated, in the root of the repo, run `pip install -r requirements.txt`
59+
---
4360

44-
Finally, we require the `token` and `target` vectors from [code2vec](https://github.com/tech-srl/code2vec). The tagger will attempt to automatically download them if it doesn't find them, but you could download them yourself if you like. It will place them in your local directory under `./code2vec/*`
61+
## Evaluation Results
4562

46-
## Usage
63+
### DistilBERT (LM-Based Model) — Recommended
4764

48-
```
49-
usage: main [-h] [-v] [-r] [-t] [-a ADDRESS] [--port PORT] [--protocol PROTOCOL]
50-
[--words WORDS]
51-
52-
options:
53-
-h, --help show this help message and exit
54-
-v, --version print tagger application version
55-
-r, --run run server for part of speech tagging requests
56-
-t, --train run training set to retrain the model
57-
-a ADDRESS, --address ADDRESS
58-
configure server address
59-
--port PORT configure server port
60-
--protocol PROTOCOL configure whether the server uses http or https
61-
--words WORDS provide path to a list of acceptable abbreviations
62-
```
65+
| Metric | Score |
66+
|--------------------------|---------|
67+
| **Macro F1** | 0.9032 |
68+
| **Token Accuracy** | 0.9223 |
69+
| **Identifier Accuracy** | 0.8291 |
6370

64-
`./main -r` will start the server, which will listen for identifier names sent via HTTP over the route:
71+
| Label | Precision | Recall | F1 | Support |
72+
|-------|-----------|--------|-------|---------|
73+
| CJ | 0.88 | 0.88 | 0.88 | 8 |
74+
| D | 0.98 | 0.96 | 0.97 | 52 |
75+
| DT | 0.95 | 0.93 | 0.94 | 45 |
76+
| N | 0.94 | 0.94 | 0.94 | 418 |
77+
| NM | 0.91 | 0.93 | 0.92 | 440 |
78+
| NPL | 0.97 | 0.97 | 0.97 | 79 |
79+
| P | 0.94 | 0.92 | 0.93 | 79 |
80+
| PRE | 0.79 | 0.79 | 0.79 | 68 |
81+
| V | 0.89 | 0.84 | 0.86 | 110 |
82+
| VM | 0.79 | 0.85 | 0.81 | 13 |
6583

66-
http://127.0.0.1:8080/{identifier_name}/{code_context}/{database_name (optional)}
84+
**Inference Performance:**
85+
- Identifiers/sec: 225.8
6786

68-
"database name" specifies an sqlite database to be used for result caching and data collection. If the database specified does not exist, one will be created.
87+
---
6988

70-
You can check wehther or not a database exists by using the `/probe` route by sending an HTTP request like this:
89+
### Gradient Boost Model (Legacy)
7190

72-
http://127.0.0.1:5000/probe/{database_name}
91+
| Metric | Score |
92+
|----------------------|-----------|
93+
| Accuracy | 0.8216 |
94+
| Balanced Accuracy | 0.9160 |
95+
| Weighted Recall | 0.8216 |
96+
| Weighted Precision | 0.8245 |
97+
| Weighted F1 | 0.8220 |
98+
| Inference Time | 249.05s |
7399

74-
"code context" is one of:
75-
- FUNCTION
76-
- ATTRIBUTE
77-
- CLASS
78-
- DECLARATION
79-
- PARAMETER
100+
**Inference Performance:**
101+
- Identifiers/sec: 8.6
80102

81-
For example:
103+
---
82104

83-
Tag a declaration: ``http://127.0.0.1:8000/numberArray/DECLARATION/database``
105+
## Supported Tagset
84106

85-
Tag a function: ``http://127.0.0.1:8000/GetNumberArray/FUNCTION/database``
107+
| Tag | Meaning | Examples |
108+
|-------|------------------------------------|--------------------------------|
109+
| N | Noun | `user`, `Data`, `Array` |
110+
| DT | Determiner | `this`, `that`, `those` |
111+
| CJ | Conjunction | `and`, `or`, `but` |
112+
| P | Preposition | `with`, `for`, `in` |
113+
| NPL | Plural Noun | `elements`, `indices` |
114+
| NM | Noun Modifier (adjective-like) | `max`, `total`, `employee` |
115+
| V | Verb | `get`, `set`, `delete` |
116+
| VM | Verb Modifier (adverb-like) | `quickly`, `deeply` |
117+
| D | Digit | `1`, `2`, `10`, `0xAF` |
118+
| PRE | Preamble / Prefix | `m`, `b`, `GL`, `p` |
86119

87-
Tag an class: ``http://127.0.0.1:8000/PersonRecord/CLASS/database``
120+
---
88121

89-
#### Note
90-
Kebab case is not currently supported due to the limitations of Spiral. Attempting to send the tagger identifiers which are in kebab case will result in the entry of a single noun.
122+
## Docker Support (Legacy only)
91123

92-
You will need to have a way to parse code and filter out identifier names if you want to do some on-the-fly analysis of source code. We recommend [srcML](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information.
124+
For the legacy server, you can also use Docker:
93125

126+
```bash
127+
docker compose pull
128+
docker compose up
129+
```
130+
131+
---
94132

95-
## Tagset
133+
## Notes
96134

97-
**Supported Tagset**
98-
| Abbreviation | Expanded Form | Examples |
99-
|:------------:|:--------------------------------------------:|:--------------------------------------------:|
100-
| N | noun | Disneyland, shoe, faucet, mother |
101-
| DT | determiner | the, this, that, these, those, which |
102-
| CJ | conjunction | and, for, nor, but, or, yet, so |
103-
| P | preposition | behind, in front of, at, under, above |
104-
| NPL | noun plural | Streets, cities, cars, people, lists |
105-
| NM | noun modifier (**noun-adjunct**, adjective) | red, cold, hot, **bit**Set, **employee**Name |
106-
| V | verb | Run, jump, spin, |
107-
| VM | verb modifier (adverb) | Very, loudly, seriously, impatiently |
108-
| D | digit | 1, 2, 10, 4.12, 0xAF |
109-
| PRE | preamble | Gimp, GLEW, GL, G, p, m, b |
135+
- **Kebab case** is not supported (e.g., `do-something-cool`).
136+
- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
137+
- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
138+
- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.
110139

111-
**Penn Treebank to SCALAR tagset**
140+
---
112141

113-
| Penn Treebank Annotation | SCALAR Tagset |
114-
|:---------------------------:|:------------------------:|
115-
| Conjunction (CC) | Conjunction (CJ) |
116-
| Digit (CD) | Digit (D) |
117-
| Determiner (DT) | Determiner (DT) |
118-
| Foreign Word (FW) | Noun (N) |
119-
| Preposition (IN) | Preposition (P) |
120-
| Adjective (JJ) | Noun Modifier (NM) |
121-
| Comparative Adjective (JJR) | Noun Modifier (NM) |
122-
| Superlative Adjective (JJS) | Noun Modifier (NM) |
123-
| List Item (LS) | Noun (N) |
124-
| Modal (MD) | Verb (V) |
125-
| Noun Singular (NN) | Noun (N) |
126-
| Proper Noun (NNP) | Noun (N) |
127-
| Proper Noun Plural (NNPS) | Noun Plural (NPL) |
128-
| Noun Plural (NNS) | Noun Plural (NPL) |
129-
| Adverb (RB) | Verb Modifier (VM) |
130-
| Comparative Adverb (RBR) | Verb Modifier (VM) |
131-
| Particle (RP) | Verb Modifier (VM) |
132-
| Symbol (SYM) | Noun (N) |
133-
| To Preposition (TO) | Preposition (P) |
134-
| Verb (VB) | Verb (V) |
135-
| Verb (VBD) | Verb (V) |
136-
| Verb (VBG) | Verb (V) |
137-
| Verb (VBN) | Verb (V) |
138-
| Verb (VBP) | Verb (V) |
139-
| Verb (VBZ) | Verb (V) |
142+
## Citations
143+
144+
Please cite:
145+
146+
```
147+
@inproceedings{newman2025scalar,
148+
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
149+
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
150+
booktitle = {ICPC Tool Demonstrations Track},
151+
year = {2025}
152+
}
153+
154+
@article{newman2021ensemble,
155+
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
156+
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
157+
journal={IEEE Transactions on Software Engineering},
158+
year={2021},
159+
doi={10.1109/TSE.2021.3098242}
160+
}
161+
```
140162

141-
## Training the tagger
142-
You can train this tagger using the `-t` option (which will re-run the training routine). For the moment, most of this is hard-coded in, so if you want to use a different data set/different seeds, you'll need to modify the code. This will potentially change in the future.
163+
---
143164

144-
## Errors?
145-
Please make an issue if you run into errors
165+
## Training Data
146166

147-
# Please Cite the Paper(s)!
167+
You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)
148168

149-
Newman, Christian, Scholten , Brandon, Testa, Sophia, Behler, Joshua, Banabilah, Syreen, Collard, Michael L., Decker, Michael, Mkaouer, Mohamed Wiem, Zampieri, Marcos, Alomar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I., (2025), “SCALAR: A Part-of-speech Tagger for Identifiers”, in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension - Tool Demonstrations Track (ICPC), Ottawa, ON, Canada, April 27 -28, 5 pages TO APPEAR.
169+
---
150170

151-
Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
171+
## More from SCANL
152172

153-
# Training set
154-
The data used to train this tagger can be found in the most recent database update in the repo -- https://github.com/SCANL/scanl_tagger/blob/master/input/scanl_tagger_training_db_11_29_2024.db
173+
- [SCANL Website](https://www.scanl.org/)
174+
- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
155175

156-
# Interested in our other work?
157-
Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
176+
---
158177

159-
# WordNet
160-
This project uses WordNet to perform a dictionary lookup on the individual words in each identifier:
178+
## Trouble?
161179

162-
Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010
180+
Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!

0 commit comments

Comments
 (0)