SCANL · cnewman · Jun 10, 2025 · Jun 2, 2025 · Jun 3, 2025 · Jun 3, 2025
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -2,7 +2,7 @@ name: SCALAR Tagger CI
 
 on:
   push:
-    branches: [ master, develop ]
+    branches: [ master, develop, distilbert ]
   pull_request:
     branches: [ master, develop ]
 
@@ -78,12 +78,12 @@ jobs:
 
       - name: Start tagger server
         run: |
-          ./main -r &
+          python main --mode run --model_type lm_based &
 
           # Wait for up to 5 minutes for the service to start and load models
           timeout=300
           while [ $timeout -gt 0 ]; do
-            if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
+            if curl -s "http://localhost:8080/numberArray/DECLARATION" > /dev/null; then
               echo "Service is ready"
               break
             fi
@@ -101,7 +101,7 @@ jobs:
 
       - name: Test tagger endpoint
         run: |
-          response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
+          response=$(curl -s "http://localhost:8080/numberArray/DECLARATION")
           if [ -z "$response" ]; then
             echo "No response from tagger"
             exit 1

diff --git a/Dockerfile b/Dockerfile
@@ -1,18 +1,11 @@
 FROM python:3.12-slim
 
-#argument to enable GPU accelaration
-ARG GPU=false
-
 # Install (and build) requirements
 COPY requirements.txt /requirements.txt
-COPY requirements_gpu.txt /requirements_gpu.txt
 RUN apt-get clean && rm -rf /var/lib/apt/lists/* && \
     apt-get update --fix-missing && \
     apt-get install --allow-unauthenticated -y git curl && \
     pip install -r requirements.txt && \
-    if [ "$GPU" = true ]; then \
-        pip install -r requirements_gpu.txt; \
-    fi && \
     apt-get clean && rm -rf /var/lib/apt/lists/*
 
 COPY . .
@@ -77,6 +70,6 @@ CMD date; \
     fi; \
     date; \
     echo "Running..."; \
-    /main -r --words words/abbreviationList.csv
+    /main --mode train --model_type lm_based --words words/abbreviationList.csv
 
 ENV TZ=US/Michigan
diff --git a/README.md b/README.md
@@ -1,162 +1,180 @@
-# SCALAR Part-of-speech tagger
-This the official release of the SCALAR Part-of-speech tagger
+# SCALAR Part-of-Speech Tagger for Identifiers
 
-There are two ways to run the tagger. This document describes both ways.
+**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:
 
-1. Using Docker compose (which runs the tagger's built-in server for you)
-2. Running the tagger's built-in server without Docker
+- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
+- Legacy Gradient Boosting model (for compatibility)
 
-## Current Metrics (this will be updated every time we update/change the model!)
-|            | Accuracy | Balanced Accuracy | Weighted Recall | Weighted Precision | Weighted F1 | Performance (seconds) |
-|------------|:--------:|:------------------:|:---------------:|:------------------:|:-----------:|:---------------------:|
-| **SCALAR** | **0.8216** | **0.9160** | **0.8216** | **0.8245** | **0.8220** | **249.05** |
-| Ensemble   | 0.7124   | 0.8311             | 0.7124          | 0.7597             | 0.7235      | 1149.44                |
-| Flair      | 0.6087   | 0.7844             | 0.6087          | 0.7755             | 0.6497      | 807.03                 |
+---
 
-## Getting Started with Docker
+## Installation
 
-To run SCALAR in a Docker container you can clone the repository and pull the latest docker impage from `sourceslicer/scalar_tagger:latest`
+Make sure you have `python3.12` installed. Then:
 
-Make sure you have Docker and Docker Compose installed:
+```bash
+git clone https://github.com/SCANL/scanl_tagger.git
+cd scanl_tagger
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+
+---
+
+## Usage
 
-https://docs.docker.com/engine/install/
+You can run SCALAR in multiple ways:
 
-https://docs.docker.com/compose/install/
+### CLI (with DistilBERT or GradientBoosting model)
 
+```bash
+python main --mode run --model_type lm_based         # DistilBERT (recommended)
+python main --mode run --model_type tree_based       # Legacy model
 ```
-git clone [email protected]:SCANL/scanl_tagger.git
-cd scanl_tagger
-docker compose pull
-docker compose up
+
+Then query like:
+
+```
+http://127.0.0.1:8080/GetValue/FUNCTION
 ```
 
-## Getting Started without Docker
-You will need `python3.12` installed. 
+Supports context types:
+- FUNCTION
+- CLASS
+- ATTRIBUTE
+- DECLARATION
+- PARAMETER
+
+---
 
-You'll need to install `pip` -- https://pip.pypa.io/en/stable/installation/
+## Training
 
-Set up a virtual environtment: `python -m venv /tmp/tagger` -- feel free to put it somewhere else (change /tmp/tagger) if you prefer
+You can retrain either model (default parameters are currently hardcoded):
 
-Activate the virtual environment: `source /tmp/tagger/bin/activate` (you can find how to activate it here if `source` does not work for you -- https://docs.python.org/3/library/venv.html#how-venvs-work)
+```bash
+python main --mode train --model_type lm_based
+python main --mode train --model_type tree_based
+```
 
-After it's installed and your virtual environment is activated, in the root of the repo, run `pip install -r requirements.txt`
+---
 
-Finally, we require the `token` and `target` vectors from [code2vec](https://github.com/tech-srl/code2vec). The tagger will attempt to automatically download them if it doesn't find them, but you could download them yourself if you like. It will place them in your local directory under `./code2vec/*`
+## Evaluation Results
 
-## Usage
+### DistilBERT (LM-Based Model) — Recommended
 
-```
-usage: main [-h] [-v] [-r] [-t] [-a ADDRESS] [--port PORT] [--protocol PROTOCOL]
-            [--words WORDS]
-
-options:
-  -h, --help            show this help message and exit
-  -v, --version         print tagger application version
-  -r, --run             run server for part of speech tagging requests
-  -t, --train           run training set to retrain the model
-  -a ADDRESS, --address ADDRESS
-                        configure server address
-  --port PORT           configure server port
-  --protocol PROTOCOL   configure whether the server uses http or https
-  --words WORDS         provide path to a list of acceptable abbreviations
-```
+| Metric                   | Score   |
+|--------------------------|---------|
+| **Macro F1**             | 0.9032  |
+| **Token Accuracy**       | 0.9223  |
+| **Identifier Accuracy**  | 0.8291  |
 
-`./main -r` will start the server, which will listen for identifier names sent via HTTP over the route:
+| Label | Precision | Recall | F1    | Support |
+|-------|-----------|--------|-------|---------|
+| CJ    | 0.88      | 0.88   | 0.88  | 8       |
+| D     | 0.98      | 0.96   | 0.97  | 52      |
+| DT    | 0.95      | 0.93   | 0.94  | 45      |
+| N     | 0.94      | 0.94   | 0.94  | 418     |
+| NM    | 0.91      | 0.93   | 0.92  | 440     |
+| NPL   | 0.97      | 0.97   | 0.97  | 79      |
+| P     | 0.94      | 0.92   | 0.93  | 79      |
+| PRE   | 0.79      | 0.79   | 0.79  | 68      |
+| V     | 0.89      | 0.84   | 0.86  | 110     |
+| VM    | 0.79      | 0.85   | 0.81  | 13      |
 
-http://127.0.0.1:8080/{identifier_name}/{code_context}/{database_name (optional)}
+**Inference Performance:**
+- Identifiers/sec: 225.8
 
-"database name" specifies an sqlite database to be used for result caching and data collection. If the database specified does not exist, one will be created. 
+---
 
-You can check wehther or not a database exists by using the `/probe` route by sending an HTTP request like this:
+### Gradient Boost Model (Legacy)
 
-http://127.0.0.1:5000/probe/{database_name}
+| Metric               | Score     |
+|----------------------|-----------|
+| Accuracy             | 0.8216    |
+| Balanced Accuracy    | 0.9160    |
+| Weighted Recall      | 0.8216    |
+| Weighted Precision   | 0.8245    |
+| Weighted F1          | 0.8220    |
+| Inference Time       | 249.05s   |
 
-"code context" is one of:
-- FUNCTION
-- ATTRIBUTE
-- CLASS
-- DECLARATION
-- PARAMETER
+**Inference Performance:**
+- Identifiers/sec: 8.6
 
-For example:
+---
 
-Tag a declaration: ``http://127.0.0.1:8000/numberArray/DECLARATION/database``
+## Supported Tagset
 
-Tag a function: ``http://127.0.0.1:8000/GetNumberArray/FUNCTION/database``
+| Tag   | Meaning                            | Examples                       |
+|-------|------------------------------------|--------------------------------|
+| N     | Noun                               | `user`, `Data`, `Array`        |
+| DT    | Determiner                         | `this`, `that`, `those`        |
+| CJ    | Conjunction                        | `and`, `or`, `but`             |
+| P     | Preposition                        | `with`, `for`, `in`            |
+| NPL   | Plural Noun                        | `elements`, `indices`          |
+| NM    | Noun Modifier (adjective-like)     | `max`, `total`, `employee`     |
+| V     | Verb                               | `get`, `set`, `delete`         |
+| VM    | Verb Modifier (adverb-like)        | `quickly`, `deeply`            |
+| D     | Digit                              | `1`, `2`, `10`, `0xAF`         |
+| PRE   | Preamble / Prefix                  | `m`, `b`, `GL`, `p`            |
 
-Tag an class: ``http://127.0.0.1:8000/PersonRecord/CLASS/database``
+---
 
-#### Note
-Kebab case is not currently supported due to the limitations of Spiral. Attempting to send the tagger identifiers which are in kebab case will result in the entry of a single noun. 
+## Docker Support (Legacy only)
 
-You will need to have a way to parse code and filter out identifier names if you want to do some on-the-fly analysis of source code. We recommend [srcML](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information. 
+For the legacy server, you can also use Docker:
 
+```bash
+docker compose pull
+docker compose up
+```
+
+---
 
-## Tagset
+## Notes
 
-**Supported Tagset**
-| Abbreviation |                 Expanded Form                |                   Examples                   |
-|:------------:|:--------------------------------------------:|:--------------------------------------------:|
-|       N      |                     noun                     | Disneyland, shoe, faucet, mother             |
-|      DT      |                  determiner                  | the, this, that, these, those, which         |
-|      CJ      |                  conjunction                 | and, for, nor, but, or, yet, so              |
-|       P      |                  preposition                 | behind, in front of, at, under, above        |
-|      NPL     |                  noun plural                 | Streets, cities, cars, people, lists         |
-|      NM      | noun modifier  (**noun-adjunct**, adjective) | red, cold, hot, **bit**Set, **employee**Name |
-|       V      |                     verb                     | Run, jump, spin,                             |
-|      VM      |            verb modifier  (adverb)           | Very, loudly, seriously, impatiently         |
-|       D      |                     digit                    | 1, 2, 10, 4.12, 0xAF                         |
-|      PRE     |                   preamble                   | Gimp, GLEW, GL, G, p, m, b                   |
+- **Kebab case** is not supported (e.g., `do-something-cool`).
+- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
+- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
+- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.
 
-**Penn Treebank to SCALAR tagset**
+---
 
-|   Penn Treebank Annotation  | SCALAR Tagset            |
-|:---------------------------:|:------------------------:|
-|       Conjunction (CC)      |     Conjunction (CJ)     |
-|          Digit (CD)         |         Digit (D)        |
-|       Determiner (DT)       |      Determiner (DT)     |
-|      Foreign Word (FW)      |         Noun (N)         |
-|       Preposition (IN)      |      Preposition (P)     |
-|        Adjective (JJ)       |    Noun Modifier (NM)    |
-| Comparative Adjective (JJR) |    Noun Modifier (NM)    |
-| Superlative Adjective (JJS) |    Noun Modifier (NM)    |
-|        List Item (LS)       |         Noun (N)         |
-|          Modal (MD)         |         Verb (V)         |
-|      Noun Singular (NN)     |         Noun (N)         |
-|      Proper Noun (NNP)      |         Noun (N)         |
-|  Proper Noun Plural (NNPS)  |     Noun Plural (NPL)    |
-|      Noun Plural (NNS)      |     Noun Plural (NPL)    |
-|         Adverb (RB)         |    Verb Modifier (VM)    |
-|   Comparative Adverb (RBR)  |    Verb Modifier (VM)    |
-|        Particle (RP)        |    Verb Modifier (VM)    |
-|         Symbol (SYM)        |         Noun (N)         |
-|     To Preposition (TO)     |      Preposition (P)     |
-|          Verb (VB)          |         Verb (V)         |
-|          Verb (VBD)         |         Verb (V)         |
-|          Verb (VBG)         |         Verb (V)         |
-|          Verb (VBN)         |         Verb (V)         |
-|          Verb (VBP)         |         Verb (V)         |
-|          Verb (VBZ)         |         Verb (V)         |
+## Citations
+
+Please cite:
+
+```
+@inproceedings{newman2025scalar,
+  author    = {Christian Newman and Brandon Scholten and Sophia Testa and others},
+  title     = {SCALAR: A Part-of-speech Tagger for Identifiers},
+  booktitle = {ICPC Tool Demonstrations Track},
+  year      = {2025}
+}
+
+@article{newman2021ensemble,
+  title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
+  author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
+  journal={IEEE Transactions on Software Engineering},
+  year={2021},
+  doi={10.1109/TSE.2021.3098242}
+}
+```
 
-## Training the tagger
-You can train this tagger using the `-t` option (which will re-run the training routine). For the moment, most of this is hard-coded in, so if you want to use a different data set/different seeds, you'll need to modify the code. This will potentially change in the future.
+---
 
-## Errors?
-Please make an issue if you run into errors
+## Training Data
 
-# Please Cite the Paper(s)!
+You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)
 
-Newman, Christian, Scholten , Brandon, Testa, Sophia, Behler, Joshua, Banabilah, Syreen, Collard, Michael L., Decker, Michael, Mkaouer, Mohamed Wiem, Zampieri, Marcos, Alomar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I., (2025), “SCALAR: A Part-of-speech Tagger for Identifiers”, in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension - Tool Demonstrations Track (ICPC), Ottawa, ON, Canada, April 27 -28, 5 pages TO APPEAR.
+---
 
-Christian  D.  Newman,  Michael  J.  Decker,  Reem  S.  AlSuhaibani,  Anthony  Peruma,  Satyajit  Mohapatra,  Tejal  Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
+## More from SCANL
 
-# Training set
-The data used to train this tagger can be found in the most recent database update in the repo -- https://github.com/SCANL/scanl_tagger/blob/master/input/scanl_tagger_training_db_11_29_2024.db
+- [SCANL Website](https://www.scanl.org/)
+- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
 
-# Interested in our other work?
-Find our other research [at our webpage](https://www.scanl.org/) and check out the [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
+---
 
-# WordNet
-This project uses WordNet to perform a dictionary lookup on the individual words in each identifier:
+## Trouble?
 
-Princeton University "About WordNet." [WordNet](https://wordnet.princeton.edu/). Princeton University. 2010
+Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!