Update readme with new arguments and data

Christian Newman · Christian Newman · commit bde70e21afc8 · 2025-06-09T23:16:50.000-04:00
diff --git a/README.md b/README.md
@@ -1,49 +1,180 @@
-# SCALAR Part-of-speech tagger
+# SCALAR Part-of-Speech Tagger for Identifiers
 
-THIS IS AN EXPERIMENTAL VERSION OF SCALAR
+**SCALAR** is a part-of-speech tagger for source code identifiers. It supports two model types:
 
-Install requirements via `pip install -r requirements.txt`
+- **DistilBERT-based model with CRF layer** (Recommended: faster, more accurate)
+- Legacy Gradient Boosting model (for compatibility)
 
-Run via `python3 main --mode run --model_type lm_based`
+---
 
-You can attempt to train it `python main --mode train --model_type lm_based` -- but I make no guarantees about how easily it will work at this stage
+## Installation
 
-It still technically supports the old gradientboost model, too... but no guarantees as to how well it functions in this branch.
+Make sure you have `python3.12` installed. Then:
 
-## Evaluation Results (Held-Out Set)
+```bash
+git clone https://github.com/SCANL/scanl_tagger.git
+cd scanl_tagger
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
 
-### Per-Class Metrics
+---
 
-| Label | Precision | Recall | F1-Score | Support |
-|-------|-----------|--------|----------|---------|
-| CJ    | 0.88      | 0.88   | 0.88     | 8       |
-| D     | 0.98      | 0.96   | 0.97     | 52      |
-| DT    | 0.95      | 0.93   | 0.94     | 45      |
-| N     | 0.94      | 0.94   | 0.94     | 418     |
-| NM    | 0.91      | 0.93   | 0.92     | 440     |
-| NPL   | 0.97      | 0.97   | 0.97     | 79      |
-| P     | 0.94      | 0.92   | 0.93     | 79      |
-| PRE   | 0.79      | 0.79   | 0.79     | 68      |
-| V     | 0.89      | 0.84   | 0.86     | 110     |
-| VM    | 0.79      | 0.85   | 0.81     | 13      |
+## Usage
 
-### Aggregate Metrics
+You can run SCALAR in multiple ways:
 
-| Metric              | Score  |
-|---------------------|--------|
-| Accuracy            | 0.92   |
-| Macro Avg F1        | 0.90   |
-| Weighted Avg F1     | 0.92   |
-| Total Examples      | 1312   |
+### CLI (with DistilBERT or GradientBoosting model)
 
-### Inference Statistics
+```bash
+python main --mode run --model_type lm_based         # DistilBERT (recommended)
+python main --mode run --model_type tree_based       # Legacy model
+```
 
-- **Inference Time:** 1.74s for 392 identifiers (3746 tokens)  
-- **Tokens/sec:** 2157.78  
-- **Identifiers/sec:** 225.80  
+Then query like:
 
-### Final Scores
+```
+http://127.0.0.1:8080/GetValue/FUNCTION
+```
 
-- **Final Macro F1 on Held-Out Set:** 0.9032  
-- **Final Token-level Accuracy:** 0.9223  
-- **Final Identifier-level Accuracy:** 0.8291  
+Supports context types:
+- FUNCTION
+- CLASS
+- ATTRIBUTE
+- DECLARATION
+- PARAMETER
+
+---
+
+## Training
+
+You can retrain either model (default parameters are currently hardcoded):
+
+```bash
+python main --mode train --model_type lm_based
+python main --mode train --model_type tree_based
+```
+
+---
+
+## Evaluation Results
+
+### DistilBERT (LM-Based Model) — Recommended
+
+| Metric                   | Score   |
+|--------------------------|---------|
+| **Macro F1**             | 0.9032  |
+| **Token Accuracy**       | 0.9223  |
+| **Identifier Accuracy**  | 0.8291  |
+
+| Label | Precision | Recall | F1    | Support |
+|-------|-----------|--------|-------|---------|
+| CJ    | 0.88      | 0.88   | 0.88  | 8       |
+| D     | 0.98      | 0.96   | 0.97  | 52      |
+| DT    | 0.95      | 0.93   | 0.94  | 45      |
+| N     | 0.94      | 0.94   | 0.94  | 418     |
+| NM    | 0.91      | 0.93   | 0.92  | 440     |
+| NPL   | 0.97      | 0.97   | 0.97  | 79      |
+| P     | 0.94      | 0.92   | 0.93  | 79      |
+| PRE   | 0.79      | 0.79   | 0.79  | 68      |
+| V     | 0.89      | 0.84   | 0.86  | 110     |
+| VM    | 0.79      | 0.85   | 0.81  | 13      |
+
+**Inference Performance:**
+- Identifiers/sec: 225.8
+
+---
+
+### Gradient Boost Model (Legacy)
+
+| Metric               | Score     |
+|----------------------|-----------|
+| Accuracy             | 0.8216    |
+| Balanced Accuracy    | 0.9160    |
+| Weighted Recall      | 0.8216    |
+| Weighted Precision   | 0.8245    |
+| Weighted F1          | 0.8220    |
+| Inference Time       | 249.05s   |
+
+**Inference Performance:**
+- Identifiers/sec: 8.6
+
+---
+
+## Supported Tagset
+
+| Tag   | Meaning                            | Examples                       |
+|-------|------------------------------------|--------------------------------|
+| N     | Noun                               | `user`, `Data`, `Array`        |
+| DT    | Determiner                         | `this`, `that`, `those`        |
+| CJ    | Conjunction                        | `and`, `or`, `but`             |
+| P     | Preposition                        | `with`, `for`, `in`            |
+| NPL   | Plural Noun                        | `elements`, `indices`          |
+| NM    | Noun Modifier (adjective-like)     | `max`, `total`, `employee`     |
+| V     | Verb                               | `get`, `set`, `delete`         |
+| VM    | Verb Modifier (adverb-like)        | `quickly`, `deeply`            |
+| D     | Digit                              | `1`, `2`, `10`, `0xAF`         |
+| PRE   | Preamble / Prefix                  | `m`, `b`, `GL`, `p`            |
+
+---
+
+## Docker Support (Legacy only)
+
+For the legacy server, you can also use Docker:
+
+```bash
+docker compose pull
+docker compose up
+```
+
+---
+
+## Notes
+
+- **Kebab case** is not supported (e.g., `do-something-cool`).
+- Feature and position tokens (e.g., `@pos_0`) are inserted automatically.
+- Internally uses [WordNet](https://wordnet.princeton.edu/) for lexical features.
+- Input must be parsed into identifier tokens. We recommend [srcML](https://www.srcml.org/) but any AST-based parser works.
+
+---
+
+## Citations
+
+Please cite:
+
+```
+@inproceedings{newman2025scalar,
+  author    = {Christian Newman and Brandon Scholten and Sophia Testa and others},
+  title     = {SCALAR: A Part-of-speech Tagger for Identifiers},
+  booktitle = {ICPC Tool Demonstrations Track},
+  year      = {2025}
+}
+
+@article{newman2021ensemble,
+  title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
+  author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
+  journal={IEEE Transactions on Software Engineering},
+  year={2021},
+  doi={10.1109/TSE.2021.3098242}
+}
+```
+
+---
+
+## Training Data
+
+You can find the most recent SCALAR training dataset [here](https://github.com/SCANL/scanl_tagger/blob/master/input/tagger_data.tsv)
+
+---
+
+## More from SCANL
+
+- [SCANL Website](https://www.scanl.org/)
+- [Identifier Name Structure Catalogue](https://github.com/SCANL/identifier_name_structure_catalogue)
+
+---
+
+## Trouble?
+
+Please [open an issue](https://github.com/SCANL/scanl_tagger/issues) if you encounter problems!