Skip to content

Commit ee69358

Browse files
authored
Merge pull request #10 from SCANL/develop
Improve development infrastructure to help with deploying in the future
2 parents 087c498 + 4980058 commit ee69358

19 files changed

+748
-1066
lines changed

.github/workflows/tests.yml

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
name: SCALAR Tagger CI
2+
3+
on:
4+
push:
5+
branches: [ main, develop ]
6+
pull_request:
7+
branches: [ main, develop ]
8+
9+
jobs:
10+
test-docker:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- name: Pull pre-built image
16+
run: docker pull sourceslicer/scalar_tagger:latest
17+
18+
- name: Start container
19+
run: |
20+
docker run -d -p 8080:8080 sourceslicer/scalar_tagger:latest
21+
22+
- name: Wait for service to start
23+
run: |
24+
# Wait for up to 10 minutes for the service to start
25+
timeout=600
26+
while [ $timeout -gt 0 ]; do
27+
if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
28+
echo "Service is ready"
29+
break
30+
fi
31+
echo "Waiting for service to start... ($timeout seconds remaining)"
32+
sleep 5
33+
timeout=$((timeout - 5))
34+
done
35+
36+
if [ $timeout -le 0 ]; then
37+
echo "Service failed to start within timeout"
38+
docker logs $(docker ps -q)
39+
exit 1
40+
fi
41+
42+
- name: Test tagger endpoint
43+
run: |
44+
response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
45+
if [ -z "$response" ]; then
46+
echo "No response from tagger"
47+
exit 1
48+
fi
49+
echo "Received response: $response"
50+
51+
test-native:
52+
runs-on: ubuntu-latest
53+
steps:
54+
- uses: actions/checkout@v4
55+
56+
- name: Set up Python 3.12
57+
uses: actions/setup-python@v5
58+
with:
59+
python-version: '3.12'
60+
61+
- name: Create and activate virtual environment
62+
run: |
63+
python -m venv /tmp/tagger
64+
source /tmp/tagger/bin/activate
65+
66+
- name: Install dependencies
67+
run: |
68+
pip install -r requirements.txt
69+
70+
- name: Download FastText model
71+
run: |
72+
python -c "
73+
import gensim.downloader as api
74+
print('Downloading FastText model...')
75+
_ = api.load('fasttext-wiki-news-subwords-300')
76+
print('FastText model downloaded successfully')
77+
"
78+
79+
- name: Start tagger server
80+
run: |
81+
./main -r &
82+
83+
# Wait for up to 5 minutes for the service to start and load models
84+
timeout=300
85+
while [ $timeout -gt 0 ]; do
86+
if curl -s "http://localhost:8080/cache/numberArray/DECLARATION" > /dev/null; then
87+
echo "Service is ready"
88+
break
89+
fi
90+
echo "Waiting for service to start... ($timeout seconds remaining)"
91+
sleep 10
92+
timeout=$((timeout - 10))
93+
done
94+
95+
if [ $timeout -le 0 ]; then
96+
echo "Service failed to start within timeout"
97+
# Print logs or debug information
98+
cat logs/*.log 2>/dev/null || true
99+
exit 1
100+
fi
101+
102+
- name: Test tagger endpoint
103+
run: |
104+
response=$(curl -s "http://localhost:8080/cache/numberArray/DECLARATION")
105+
if [ -z "$response" ]; then
106+
echo "No response from tagger"
107+
exit 1
108+
fi
109+
echo "Received response: $response"
110+
111+
- name: Cache FastText model
112+
uses: actions/cache@v3
113+
with:
114+
path: ~/.cache/gensim-data/fasttext-wiki-news-subwords-300*
115+
key: ${{ runner.os }}-fasttext-model

Dockerfile

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM python:3.10-slim
1+
FROM python:3.12-slim
22

33
# Install (and build) requirements
44
COPY requirements.txt /requirements.txt
@@ -7,16 +7,22 @@ RUN apt-get update && \
77
pip install -r requirements.txt && \
88
rm -rf /var/lib/apt/lists/*
99

10+
COPY . .
11+
RUN pip install -e .
12+
13+
# Download FastText model during build
14+
RUN python3 -c "import gensim.downloader as api; api.load('fasttext-wiki-news-subwords-300')"
15+
1016
# ntlk downloads
1117
RUN python3 -c "import nltk; nltk.download('averaged_perceptron_tagger');nltk.download('universal_tagset')"
1218

13-
# Pythong scripts and data
14-
COPY classifier_multiclass.py \
15-
download_code2vec_vectors.py \
16-
feature_generator.py \
17-
print_utility_functions.py \
18-
tag_identifier.py \
19-
create_models.py \
19+
# Python scripts and data
20+
COPY src/classifier_multiclass.py \
21+
src/download_code2vec_vectors.py \
22+
src/feature_generator.py \
23+
src/tag_identifier.py \
24+
src/create_models.py \
25+
version.py \
2026
serve.json \
2127
main \
2228
/.
@@ -62,10 +68,7 @@ CMD date; \
6268
echo "Failed to retrieve Last-Modified headers"; \
6369
fi; \
6470
date; \
65-
echo "Training..."; \
66-
/main -t; \
67-
date; \
6871
echo "Running..."; \
6972
/main -r --words words/abbreviationList.csv
7073

71-
ENV TZ=US/Michigan
74+
ENV TZ=US/Michigan

README.md

Lines changed: 66 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,36 @@
11
# SCALAR Part-of-speech tagger
22
This the official release of the SCALAR Part-of-speech tagger
33

4+
There are two ways to run the tagger. This document describes both ways.
5+
6+
1. Using Docker compose (which runs the tagger's built-in server for you)
7+
2. Running the tagger's built-in server without Docker
8+
49
## Getting Started with Docker
510

611
To run SCNL tagger in a Docker container you can clone the repository and pull the latest docker impage from `srcml/scanl_tagger:latest`
712

13+
Make sure you have Docker and Docker Compose installed:
14+
https://docs.docker.com/engine/install/
15+
https://docs.docker.com/compose/install/
16+
817
```
9-
git clone https://github.com/brandonscholten/scanl_tagger.git
18+
git clone git@github.com:SCANL/scanl_tagger.git
1019
cd scanl_tagger
1120
docker compose pull
1221
docker compose up
1322
```
1423

15-
## Setup and Run
16-
You will need `python3.10` installed.
24+
## Getting Started without Docker
25+
You will need `python3.12` installed.
1726

1827
You'll need to install `pip` -- https://pip.pypa.io/en/stable/installation/
1928

20-
After it's installed, in the root of the repo, run `pip install -r requirements.txt`
29+
Set up a virtual environtment: `python -m venv /tmp/tagger` -- feel free to put it somewhere else (change /tmp/tagger) if you prefer
2130

22-
Finally, you need to install Spiral, which we use for identifier splitting. The current version of Spiral on the official repo has a [problem](https://github.com/casics/spiral/issues/4), so consider installing the one from the link below:
31+
Activate the virtual environment: `source /tmp/tagger/bin/activate` (you can find how to activate it here if `source` does not work for you -- https://docs.python.org/3/library/venv.html#how-venvs-work)
2332

24-
pip install git+https://github.com/cnewman/spiral.git
33+
After it's installed and your virtual environment is activated, in the root of the repo, run `pip install -r requirements.txt`
2534

2635
Finally, we require the `token` and `target` vectors from [code2vec](https://github.com/tech-srl/code2vec). The tagger will attempt to automatically download them if it doesn't find them, but you could download them yourself if you like. It will place them in your local directory under `./code2vec/*`
2736

@@ -47,6 +56,8 @@ options:
4756

4857
http://127.0.0.1:5000/{cache_selection}/{identifier_name}/{code_context}
4958

59+
**NOTE: ** On docker, the port is 8080 instead of 5000.
60+
5061
"cache selection" will save results to a separate cache if it is set to "student"
5162

5263
"code context" is one of:
@@ -69,15 +80,62 @@ Kebab case is not currently supported due to the limitations of Spiral. Attempti
6980

7081
You will need to have a way to parse code and filter out identifier names if you want to do some on-the-fly analysis of source code. We recommend [srcML](https://www.srcml.org/). Since the actual tagger is a web server, you don't have to use srcML. You could always use other AST-based code representations, or any other method of obtaining identifier information.
7182

83+
84+
## Tagset
85+
86+
**Supported Tagset**
87+
| Abbreviation | Expanded Form | Examples |
88+
|:------------:|:--------------------------------------------:|:--------------------------------------------:|
89+
| N | noun | Disneyland, shoe, faucet, mother |
90+
| DT | determiner | the, this, that, these, those, which |
91+
| CJ | conjunction | and, for, nor, but, or, yet, so |
92+
| P | preposition | behind, in front of, at, under, above |
93+
| NPL | noun plural | Streets, cities, cars, people, lists |
94+
| NM | noun modifier (**noun-adjunct**, adjective) | red, cold, hot, **bit**Set, **employee**Name |
95+
| V | verb | Run, jump, spin, |
96+
| VM | verb modifier (adverb) | Very, loudly, seriously, impatiently |
97+
| D | digit | 1, 2, 10, 4.12, 0xAF |
98+
| PRE | preamble | Gimp, GLEW, GL, G, p, m, b |
99+
100+
**Penn Treebank to SCALAR tagset**
101+
102+
| Penn Treebank Annotation | SCALAR Tagset |
103+
|:---------------------------:|:------------------------:|
104+
| Conjunction (CC) | Conjunction (CJ) |
105+
| Digit (CD) | Digit (D) |
106+
| Determiner (DT) | Determiner (DT) |
107+
| Foreign Word (FW) | Noun (N) |
108+
| Preposition (IN) | Preposition (P) |
109+
| Adjective (JJ) | Noun Modifier (NM) |
110+
| Comparative Adjective (JJR) | Noun Modifier (NM) |
111+
| Superlative Adjective (JJS) | Noun Modifier (NM) |
112+
| List Item (LS) | Noun (N) |
113+
| Modal (MD) | Verb (V) |
114+
| Noun Singular (NN) | Noun (N) |
115+
| Proper Noun (NNP) | Noun (N) |
116+
| Proper Noun Plural (NNPS) | Noun Plural (NPL) |
117+
| Noun Plural (NNS) | Noun Plural (NPL) |
118+
| Adverb (RB) | Verb Modifier (VM) |
119+
| Comparative Adverb (RBR) | Verb Modifier (VM) |
120+
| Particle (RP) | Verb Modifier (VM) |
121+
| Symbol (SYM) | Noun (N) |
122+
| To Preposition (TO) | Preposition (P) |
123+
| Verb (VB) | Verb (V) |
124+
| Verb (VBD) | Verb (V) |
125+
| Verb (VBG) | Verb (V) |
126+
| Verb (VBN) | Verb (V) |
127+
| Verb (VBP) | Verb (V) |
128+
| Verb (VBZ) | Verb (V) |
129+
72130
## Training the tagger
73131
You can train this tagger using the `-t` option (which will re-run the training routine). For the moment, most of this is hard-coded in, so if you want to use a different data set/different seeds, you'll need to modify the code. This will potentially change in the future.
74132

75133
## Errors?
76134
Please make an issue if you run into errors
77135

78-
# Please Cite the Paper!
136+
# Please Cite the Paper(s)!
79137

80-
No paper for now however the current tagger is based on our previous, so you could cite the previous one for now:
138+
Newman, Christian, Scholten , Brandon, Testa, Sophia, Behler, Joshua, Banabilah, Syreen, Collard, Michael L., Decker, Michael, Mkaouer, Mohamed Wiem, Zampieri, Marcos, Alomar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I., (2025), “SCALAR: A Part-of-speech Tagger for Identifiers”, in the Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension - Tool Demonstrations Track (ICPC), Ottawa, ON, Canada, April 27 -28, 5 pages TO APPEAR.
81139

82140
Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, and Emily Hill, "An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3098242.
83141

__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .version import __version__, __version_info__

0 commit comments

Comments
 (0)