SCALAR is a part-of-speech tagger for source code identifiers. It supports two model types:
- DistilBERT-based model with CRF layer (Recommended: faster, more accurate)
- Legacy Gradient Boosting model (for compatibility)
Make sure you have python3.12 installed. Then:
git clone https://github.com/SCANL/scanl_tagger.git
cd scanl_tagger
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtYou can run SCALAR in multiple ways:
python main --mode run --model_type lm_based # DistilBERT (recommended)
python main --mode run --model_type tree_based # Legacy modelThen query like:
http://127.0.0.1:8080/GetValue/FUNCTION
Supports context types:
- FUNCTION
- CLASS
- ATTRIBUTE
- DECLARATION
- PARAMETER
You can retrain either model (default parameters are currently hardcoded):
python main --mode train --model_type lm_based
python main --mode train --model_type tree_based| Metric | Score |
|---|---|
| Macro F1 | 0.9032 |
| Token Accuracy | 0.9223 |
| Identifier Accuracy | 0.8291 |
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| CJ | 0.88 | 0.88 | 0.88 | 8 |
| D | 0.98 | 0.96 | 0.97 | 52 |
| DT | 0.95 | 0.93 | 0.94 | 45 |
| N | 0.94 | 0.94 | 0.94 | 418 |
| NM | 0.91 | 0.93 | 0.92 | 440 |
| NPL | 0.97 | 0.97 | 0.97 | 79 |
| P | 0.94 | 0.92 | 0.93 | 79 |
| PRE | 0.79 | 0.79 | 0.79 | 68 |
| V | 0.89 | 0.84 | 0.86 | 110 |
| VM | 0.79 | 0.85 | 0.81 | 13 |
Inference Performance:
- Identifiers/sec: 225.8
| Metric | Score |
|---|---|
| Accuracy | 0.8216 |
| Balanced Accuracy | 0.9160 |
| Weighted Recall | 0.8216 |
| Weighted Precision | 0.8245 |
| Weighted F1 | 0.8220 |
| Inference Time | 249.05s |
Inference Performance:
- Identifiers/sec: 8.6
| Tag | Meaning | Examples |
|---|---|---|
| N | Noun | user, Data, Array |
| DT | Determiner | this, that, those |
| CJ | Conjunction | and, or, but |
| P | Preposition | with, for, in |
| NPL | Plural Noun | elements, indices |
| NM | Noun Modifier (adjective-like) | max, total, employee |
| V | Verb | get, set, delete |
| VM | Verb Modifier (adverb-like) | quickly, deeply |
| D | Digit | 1, 2, 10, 0xAF |
| PRE | Preamble / Prefix | m, b, GL, p |
For the legacy server, you can also use Docker:
docker compose pull
docker compose up- Kebab case is not supported (e.g.,
do-something-cool). - Feature and position tokens (e.g.,
@pos_0) are inserted automatically. - Internally uses WordNet for lexical features.
- Input must be parsed into identifier tokens. We recommend srcML but any AST-based parser works.
Please cite:
@inproceedings{newman2025scalar,
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
booktitle = {ICPC Tool Demonstrations Track},
year = {2025}
}
@article{newman2021ensemble,
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
journal={IEEE Transactions on Software Engineering},
year={2021},
doi={10.1109/TSE.2021.3098242}
}
You can find the most recent SCALAR training dataset here
Please open an issue if you encounter problems!