SCALAR Part-of-Speech Tagger for Identifiers

SCALAR is a part-of-speech tagger for source code identifiers. It supports two model types:

DistilBERT-based model with CRF layer (Recommended: faster, more accurate)
Legacy Gradient Boosting model (for compatibility)

Installation

Make sure you have python3.12 installed. Then:

git clone https://github.com/SCANL/scanl_tagger.git
cd scanl_tagger
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

You can run SCALAR in multiple ways:

CLI (with DistilBERT or GradientBoosting model)

python main --mode run --model_type lm_based         # DistilBERT (recommended)
python main --mode run --model_type tree_based       # Legacy model

Then query like:

http://127.0.0.1:8080/GetValue/FUNCTION

Supports context types:

FUNCTION
CLASS
ATTRIBUTE
DECLARATION
PARAMETER

Training

You can retrain either model (default parameters are currently hardcoded):

python main --mode train --model_type lm_based
python main --mode train --model_type tree_based

Evaluation Results

DistilBERT (LM-Based Model) — Recommended

Metric	Score
Macro F1	0.9032
Token Accuracy	0.9223
Identifier Accuracy	0.8291

Label	Precision	Recall	F1	Support
CJ	0.88	0.88	0.88	8
D	0.98	0.96	0.97	52
DT	0.95	0.93	0.94	45
N	0.94	0.94	0.94	418
NM	0.91	0.93	0.92	440
NPL	0.97	0.97	0.97	79
P	0.94	0.92	0.93	79
PRE	0.79	0.79	0.79	68
V	0.89	0.84	0.86	110
VM	0.79	0.85	0.81	13

Inference Performance:

Identifiers/sec: 225.8

Gradient Boost Model (Legacy)

Metric	Score
Accuracy	0.8216
Balanced Accuracy	0.9160
Weighted Recall	0.8216
Weighted Precision	0.8245
Weighted F1	0.8220
Inference Time	249.05s

Inference Performance:

Identifiers/sec: 8.6

Supported Tagset

Tag	Meaning	Examples
N	Noun	`user`, `Data`, `Array`
DT	Determiner	`this`, `that`, `those`
CJ	Conjunction	`and`, `or`, `but`
P	Preposition	`with`, `for`, `in`
NPL	Plural Noun	`elements`, `indices`
NM	Noun Modifier (adjective-like)	`max`, `total`, `employee`
V	Verb	`get`, `set`, `delete`
VM	Verb Modifier (adverb-like)	`quickly`, `deeply`
D	Digit	`1`, `2`, `10`, `0xAF`
PRE	Preamble / Prefix	`m`, `b`, `GL`, `p`

Docker Support (Legacy only)

For the legacy server, you can also use Docker:

docker compose pull
docker compose up

Notes

Kebab case is not supported (e.g., do-something-cool).
Feature and position tokens (e.g., @pos_0) are inserted automatically.
Internally uses WordNet for lexical features.
Input must be parsed into identifier tokens. We recommend srcML but any AST-based parser works.

Citations

Please cite:

@inproceedings{newman2025scalar,
  author    = {Christian Newman and Brandon Scholten and Sophia Testa and others},
  title     = {SCALAR: A Part-of-speech Tagger for Identifiers},
  booktitle = {ICPC Tool Demonstrations Track},
  year      = {2025}
}

@article{newman2021ensemble,
  title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
  author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
  journal={IEEE Transactions on Software Engineering},
  year={2021},
  doi={10.1109/TSE.2021.3098242}
}

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github/workflows		.github/workflows
input		input
models		models
scripts		scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
compose.yml		compose.yml
main		main
requirements.txt		requirements.txt
serve.json		serve.json
setup.py		setup.py
version.py		version.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCALAR Part-of-Speech Tagger for Identifiers

Installation

Usage

CLI (with DistilBERT or GradientBoosting model)

Training

Evaluation Results

DistilBERT (LM-Based Model) — Recommended

Gradient Boost Model (Legacy)

Supported Tagset

Docker Support (Legacy only)

Notes

Citations

Training Data

More from SCANL

Trouble?

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

SCANL/scanl_tagger

Folders and files

Latest commit

History

Repository files navigation

SCALAR Part-of-Speech Tagger for Identifiers

Installation

Usage

CLI (with DistilBERT or GradientBoosting model)

Training

Evaluation Results

DistilBERT (LM-Based Model) — Recommended

Gradient Boost Model (Legacy)

Supported Tagset

Docker Support (Legacy only)

Notes

Citations

Training Data

More from SCANL

Trouble?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages