|
1 | | -# SCALAR Part-of-speech tagger |
| 1 | +# SCALAR Part-of-Speech Tagger for Identifiers |
2 | 2 |
|
3 | | -THIS IS AN EXPERIMENTAL VERSION OF SCALAR |
| 3 | +**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types: |
4 | 4 |
|
5 | | -Install requirements via `pip install -r requirements.txt` |
| 5 | +- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate) |
| 6 | +- Legacy Gradient Boosting model (for compatibility) |
6 | 7 |
|
7 | | -Run via `python3 main --mode run --model_type lm_based` |
| 8 | +--- |
8 | 9 |
|
9 | | -You can attempt to train it `python main --mode train --model_type lm_based` -- but I make no guarantees about how easily it will work at this stage |
| 10 | +## Installation |
10 | 11 |
|
11 | | -It still technically supports the old gradientboost model, too... but no guarantees as to how well it functions in this branch. |
| 12 | +Make sure you have `python3.12` installed. Then: |
12 | 13 |
|
13 | | -## Evaluation Results (Held-Out Set) |
| 14 | +```bash |
| 15 | +git clone https://github.com/SCANL/scanl_tagger.git |
| 16 | +cd scanl_tagger |
| 17 | +python -m venv venv |
| 18 | +source venv/bin/activate |
| 19 | +pip install -r requirements.txt |
| 20 | +``` |
14 | 21 |
|
15 | | -### Per-Class Metrics |
| 22 | +--- |
16 | 23 |
|
17 | | -| Label | Precision | Recall | F1-Score | Support | |
18 | | -|-------|-----------|--------|----------|---------| |
19 | | -| CJ | 0.88 | 0.88 | 0.88 | 8 | |
20 | | -| D | 0.98 | 0.96 | 0.97 | 52 | |
21 | | -| DT | 0.95 | 0.93 | 0.94 | 45 | |
22 | | -| N | 0.94 | 0.94 | 0.94 | 418 | |
23 | | -| NM | 0.91 | 0.93 | 0.92 | 440 | |
24 | | -| NPL | 0.97 | 0.97 | 0.97 | 79 | |
25 | | -| P | 0.94 | 0.92 | 0.93 | 79 | |
26 | | -| PRE | 0.79 | 0.79 | 0.79 | 68 | |
27 | | -| V | 0.89 | 0.84 | 0.86 | 110 | |
28 | | -| VM | 0.79 | 0.85 | 0.81 | 13 | |
| 24 | +## Usage |
29 | 25 |
|
30 | | -### Aggregate Metrics |
| 26 | +You can run SCALAR in multiple ways: |
31 | 27 |
|
32 | | -| Metric | Score | |
33 | | -|---------------------|--------| |
34 | | -| Accuracy | 0.92 | |
35 | | -| Macro Avg F1 | 0.90 | |
36 | | -| Weighted Avg F1 | 0.92 | |
37 | | -| Total Examples | 1312 | |
| 28 | +### CLI (with DistilBERT or GradientBoosting model) |
38 | 29 |
|
39 | | -### Inference Statistics |
| 30 | +```bash |
| 31 | +python main --mode run --model_type lm_based # DistilBERT (recommended) |
| 32 | +python main --mode run --model_type tree_based # Legacy model |
| 33 | +``` |
40 | 34 |
|
41 | | -- **Inference Time:** 1.74s for 392 identifiers (3746 tokens) |
42 | | -- **Tokens/sec:** 2157.78 |
43 | | -- **Identifiers/sec:** 225.80 |
| 35 | +Then query like: |
44 | 36 |
|
45 | | -### Final Scores |
| 37 | +``` |
| 38 | +http://127.0.0.1:8080/GetValue/FUNCTION |
| 39 | +``` |
46 | 40 |
|
47 | | -- **Final Macro F1 on Held-Out Set:** 0.9032 |
48 | | -- **Final Token-level Accuracy:** 0.9223 |
49 | | -- **Final Identifier-level Accuracy:** 0.8291 |
| 41 | +Supports context types: |
| 42 | +- FUNCTION |
| 43 | +- CLASS |
| 44 | +- ATTRIBUTE |
| 45 | +- DECLARATION |
| 46 | +- PARAMETER |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Training |
| 51 | + |
| 52 | +You can retrain either model (default parameters are currently hardcoded): |
| 53 | + |
| 54 | +```bash |
| 55 | +python main --mode train --model_type lm_based |
| 56 | +python main --mode train --model_type tree_based |
| 57 | +``` |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## Evaluation Results |
| 62 | + |
| 63 | +### DistilBERT (LM-Based Model) — Recommended |
| 64 | + |
| 65 | +| Metric | Score | |
| 66 | +|--------------------------|---------| |
| 67 | +| **Macro F1** | 0.9032 | |
| 68 | +| **Token Accuracy** | 0.9223 | |
| 69 | +| **Identifier Accuracy** | 0.8291 | |
| 70 | + |
| 71 | +| Label | Precision | Recall | F1 | Support | |
| 72 | +|-------|-----------|--------|-------|---------| |
| 73 | +| CJ | 0.88 | 0.88 | 0.88 | 8 | |
| 74 | +| D | 0.98 | 0.96 | 0.97 | 52 | |
| 75 | +| DT | 0.95 | 0.93 | 0.94 | 45 | |
| 76 | +| N | 0.94 | 0.94 | 0.94 | 418 | |
| 77 | +| NM | 0.91 | 0.93 | 0.92 | 440 | |
| 78 | +| NPL | 0.97 | 0.97 | 0.97 | 79 | |
| 79 | +| P | 0.94 | 0.92 | 0.93 | 79 | |
| 80 | +| PRE | 0.79 | 0.79 | 0.79 | 68 | |
| 81 | +| V | 0.89 | 0.84 | 0.86 | 110 | |
| 82 | +| VM | 0.79 | 0.85 | 0.81 | 13 | |
| 83 | + |
| 84 | +**Inference Performance:** |
| 85 | +- Identifiers/sec: 225.8 |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +### Gradient Boost Model (Legacy) |
| 90 | + |
| 91 | +| Metric | Score | |
| 92 | +|----------------------|-----------| |
| 93 | +| Accuracy | 0.8216 | |
| 94 | +| Balanced Accuracy | 0.9160 | |
| 95 | +| Weighted Recall | 0.8216 | |
| 96 | +| Weighted Precision | 0.8245 | |
| 97 | +| Weighted F1 | 0.8220 | |
| 98 | +| Inference Time | 249.05s | |
| 99 | + |
| 100 | +**Inference Performance:** |
| 101 | +- Identifiers/sec: 8.6 |
| 102 | + |
| 103 | +--- |
| 104 | + |
| 105 | +## Supported Tagset |
| 106 | + |
| 107 | +| Tag | Meaning | Examples | |
| 108 | +|-------|------------------------------------|--------------------------------| |
| 109 | +| N | Noun | `user`, `Data`, `Array` | |
| 110 | +| DT | Determiner | `this`, `that`, `those` | |
| 111 | +| CJ | Conjunction | `and`, `or`, `but` | |
| 112 | +| P | Preposition | `with`, `for`, `in` | |
| 113 | +| NPL | Plural Noun | `elements`, `indices` | |
| 114 | +| NM | Noun Modifier (adjective-like) | `max`, `total`, `employee` | |
| 115 | +| V | Verb | `get`, `set`, `delete` | |
| 116 | +| VM | Verb Modifier (adverb-like) | `quickly`, `deeply` | |
| 117 | +| D | Digit | `1`, `2`, `10`, `0xAF` | |
| 118 | +| PRE | Preamble / Prefix | `m`, `b`, `GL`, `p` | |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +## Docker Support (Legacy only) |
| 123 | + |
| 124 | +For the legacy server, you can also use Docker: |
| 125 | + |
| 126 | +```bash |
| 127 | +docker compose pull |
| 128 | +docker compose up |
| 129 | +``` |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## Notes |
| 134 | + |
| 135 | +- **Kebab case** is not supported (e.g., `do-something-cool`). |
| 136 | +- Feature and position tokens (e.g., `@pos_0`) are inserted automatically. |
| 137 | +- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features. |
| 138 | +- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works. |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +## Citations |
| 143 | + |
| 144 | +Please cite: |
| 145 | + |
| 146 | +``` |
| 147 | +@inproceedings{newman2025scalar, |
| 148 | + author = {Christian Newman and Brandon Scholten and Sophia Testa and others}, |
| 149 | + title = {SCALAR: A Part-of-speech Tagger for Identifiers}, |
| 150 | + booktitle = {ICPC Tool Demonstrations Track}, |
| 151 | + year = {2025} |
| 152 | +} |
| 153 | +
|
| 154 | +@article{newman2021ensemble, |
| 155 | + title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags}, |
| 156 | + author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others}, |
| 157 | + journal={IEEE Transactions on Software Engineering}, |
| 158 | + year={2021}, |
| 159 | + doi={10.1109/TSE.2021.3098242} |
| 160 | +} |
| 161 | +``` |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Training Data |
| 166 | + |
| 167 | +You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv) |
| 168 | + |
| 169 | +--- |
| 170 | + |
| 171 | +## More from SCANL |
| 172 | + |
| 173 | +- [SCANL Website](https://www.scanl.org/) |
| 174 | +- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue) |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## Trouble? |
| 179 | + |
| 180 | +Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems! |
0 commit comments