Skip to content

Commit bde70e2

Browse files
author
Christian Newman
committed
Update readme with new arguments and data
1 parent 26857a1 commit bde70e2

File tree

1 file changed

+166
-35
lines changed

1 file changed

+166
-35
lines changed

README.md

Lines changed: 166 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,180 @@
1-
# SCALAR Part-of-speech tagger
1+
# SCALAR Part-of-Speech Tagger for Identifiers
22

3-
THIS IS AN EXPERIMENTAL VERSION OF SCALAR
3+
**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:
44

5-
Install requirements via `pip install -r requirements.txt`
5+
- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
6+
- Legacy Gradient Boosting model (for compatibility)
67

7-
Run via `python3 main --mode run --model_type lm_based`
8+
---
89

9-
You can attempt to train it `python main --mode train --model_type lm_based` -- but I make no guarantees about how easily it will work at this stage
10+
## Installation
1011

11-
It still technically supports the old gradientboost model, too... but no guarantees as to how well it functions in this branch.
12+
Make sure you have `python3.12` installed. Then:
1213

13-
## Evaluation Results (Held-Out Set)
14+
```bash
15+
git clone https://github.com/SCANL/scanl_tagger.git
16+
cd scanl_tagger
17+
python -m venv venv
18+
source venv/bin/activate
19+
pip install -r requirements.txt
20+
```
1421

15-
### Per-Class Metrics
22+
---
1623

17-
| Label | Precision | Recall | F1-Score | Support |
18-
|-------|-----------|--------|----------|---------|
19-
| CJ | 0.88 | 0.88 | 0.88 | 8 |
20-
| D | 0.98 | 0.96 | 0.97 | 52 |
21-
| DT | 0.95 | 0.93 | 0.94 | 45 |
22-
| N | 0.94 | 0.94 | 0.94 | 418 |
23-
| NM | 0.91 | 0.93 | 0.92 | 440 |
24-
| NPL | 0.97 | 0.97 | 0.97 | 79 |
25-
| P | 0.94 | 0.92 | 0.93 | 79 |
26-
| PRE | 0.79 | 0.79 | 0.79 | 68 |
27-
| V | 0.89 | 0.84 | 0.86 | 110 |
28-
| VM | 0.79 | 0.85 | 0.81 | 13 |
24+
## Usage
2925

30-
### Aggregate Metrics
26+
You can run SCALAR in multiple ways:
3127

32-
| Metric | Score |
33-
|---------------------|--------|
34-
| Accuracy | 0.92 |
35-
| Macro Avg F1 | 0.90 |
36-
| Weighted Avg F1 | 0.92 |
37-
| Total Examples | 1312 |
28+
### CLI (with DistilBERT or GradientBoosting model)
3829

39-
### Inference Statistics
30+
```bash
31+
python main --mode run --model_type lm_based # DistilBERT (recommended)
32+
python main --mode run --model_type tree_based # Legacy model
33+
```
4034

41-
- **Inference Time:** 1.74s for 392 identifiers (3746 tokens)
42-
- **Tokens/sec:** 2157.78
43-
- **Identifiers/sec:** 225.80
35+
Then query like:
4436

45-
### Final Scores
37+
```
38+
http://127.0.0.1:8080/GetValue/FUNCTION
39+
```
4640

47-
- **Final Macro F1 on Held-Out Set:** 0.9032
48-
- **Final Token-level Accuracy:** 0.9223
49-
- **Final Identifier-level Accuracy:** 0.8291
41+
Supports context types:
42+
- FUNCTION
43+
- CLASS
44+
- ATTRIBUTE
45+
- DECLARATION
46+
- PARAMETER
47+
48+
---
49+
50+
## Training
51+
52+
You can retrain either model (default parameters are currently hardcoded):
53+
54+
```bash
55+
python main --mode train --model_type lm_based
56+
python main --mode train --model_type tree_based
57+
```
58+
59+
---
60+
61+
## Evaluation Results
62+
63+
### DistilBERT (LM-Based Model) — Recommended
64+
65+
| Metric | Score |
66+
|--------------------------|---------|
67+
| **Macro F1** | 0.9032 |
68+
| **Token Accuracy** | 0.9223 |
69+
| **Identifier Accuracy** | 0.8291 |
70+
71+
| Label | Precision | Recall | F1 | Support |
72+
|-------|-----------|--------|-------|---------|
73+
| CJ | 0.88 | 0.88 | 0.88 | 8 |
74+
| D | 0.98 | 0.96 | 0.97 | 52 |
75+
| DT | 0.95 | 0.93 | 0.94 | 45 |
76+
| N | 0.94 | 0.94 | 0.94 | 418 |
77+
| NM | 0.91 | 0.93 | 0.92 | 440 |
78+
| NPL | 0.97 | 0.97 | 0.97 | 79 |
79+
| P | 0.94 | 0.92 | 0.93 | 79 |
80+
| PRE | 0.79 | 0.79 | 0.79 | 68 |
81+
| V | 0.89 | 0.84 | 0.86 | 110 |
82+
| VM | 0.79 | 0.85 | 0.81 | 13 |
83+
84+
**Inference Performance:**
85+
- Identifiers/sec: 225.8
86+
87+
---
88+
89+
### Gradient Boost Model (Legacy)
90+
91+
| Metric | Score |
92+
|----------------------|-----------|
93+
| Accuracy | 0.8216 |
94+
| Balanced Accuracy | 0.9160 |
95+
| Weighted Recall | 0.8216 |
96+
| Weighted Precision | 0.8245 |
97+
| Weighted F1 | 0.8220 |
98+
| Inference Time | 249.05s |
99+
100+
**Inference Performance:**
101+
- Identifiers/sec: 8.6
102+
103+
---
104+
105+
## Supported Tagset
106+
107+
| Tag | Meaning | Examples |
108+
|-------|------------------------------------|--------------------------------|
109+
| N | Noun | `user`, `Data`, `Array` |
110+
| DT | Determiner | `this`, `that`, `those` |
111+
| CJ | Conjunction | `and`, `or`, `but` |
112+
| P | Preposition | `with`, `for`, `in` |
113+
| NPL | Plural Noun | `elements`, `indices` |
114+
| NM | Noun Modifier (adjective-like) | `max`, `total`, `employee` |
115+
| V | Verb | `get`, `set`, `delete` |
116+
| VM | Verb Modifier (adverb-like) | `quickly`, `deeply` |
117+
| D | Digit | `1`, `2`, `10`, `0xAF` |
118+
| PRE | Preamble / Prefix | `m`, `b`, `GL`, `p` |
119+
120+
---
121+
122+
## Docker Support (Legacy only)
123+
124+
For the legacy server, you can also use Docker:
125+
126+
```bash
127+
docker compose pull
128+
docker compose up
129+
```
130+
131+
---
132+
133+
## Notes
134+
135+
- **Kebab case** is not supported (e.g., `do-something-cool`).
136+
- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
137+
- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
138+
- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.
139+
140+
---
141+
142+
## Citations
143+
144+
Please cite:
145+
146+
```
147+
@inproceedings{newman2025scalar,
148+
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
149+
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
150+
booktitle = {ICPC Tool Demonstrations Track},
151+
year = {2025}
152+
}
153+
154+
@article{newman2021ensemble,
155+
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
156+
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
157+
journal={IEEE Transactions on Software Engineering},
158+
year={2021},
159+
doi={10.1109/TSE.2021.3098242}
160+
}
161+
```
162+
163+
---
164+
165+
## Training Data
166+
167+
You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)
168+
169+
---
170+
171+
## More from SCANL
172+
173+
- [SCANL Website](https://www.scanl.org/)
174+
- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
175+
176+
---
177+
178+
## Trouble?
179+
180+
Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!

0 commit comments

Comments
 (0)