Skip to content

Commit 4bdc9e4

Browse files
Initial commit
1 parent c8cc51f commit 4bdc9e4

File tree

86 files changed

+11993
-3
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+11993
-3
lines changed

.gitattributes

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
inference_results/dotprompts_results.csv filter=lfs diff=lfs merge=lfs -text
2+
inference_results/dotprompts_results_sample.csv filter=lfs diff=lfs merge=lfs -text
3+
inference_results/ filter=lfs diff=lfs merge=lfs -text

CITATION.cff

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# This CITATION.cff file was generated with cffinit.
2+
# Visit https://bit.ly/cffinit to generate yours today!
3+
4+
cff-version: 1.2.0
5+
title: >-
6+
Monitor-Guided Decoding of Code LMs with Static Analysis
7+
of Repository Context
8+
message: >-
9+
If you use this repository, please cite it using the metadata
10+
from this file.
11+
type: dataset
12+
authors:
13+
- given-names: Lakshya A
14+
family-names: Agrawal
15+
16+
affiliation: Microsoft Research
17+
orcid: 'https://orcid.org/0000-0003-0409-8212'
18+
- given-names: Aditya
19+
family-names: Kanade
20+
21+
affiliation: Microsoft Research
22+
- given-names: Navin
23+
family-names: Goyal
24+
25+
affiliation: Microsoft Research
26+
- given-names: Shuvendu K.
27+
family-names: Lahiri
28+
29+
affiliation: Microsoft Research
30+
- given-names: Sriram K.
31+
family-names: Rajamani
32+
33+
affiliation: Microsoft Research
34+
identifiers:
35+
- type: doi
36+
value: 10.48550/arXiv.2306.10763
37+
- type: url
38+
value: >-
39+
https://openreview.net/forum?id=qPUbKxKvXq&noteId=98Ukj82fSP
40+
abstract: >-
41+
Language models of code (LMs) work well when the
42+
surrounding code provides sufficient context. This is not
43+
true when it becomes necessary to use types, functionality
44+
or APIs defined elsewhere in the repository or a linked
45+
library, especially those not seen during training. LMs
46+
suffer from limited awareness of such global context and
47+
end up hallucinating.
48+
49+
50+
Integrated development environments (IDEs) assist
51+
developers in understanding repository context using
52+
static analysis. We extend this assistance, enjoyed by
53+
developers, to LMs. We propose monitor-guided decoding
54+
(MGD) where a monitor uses static analysis to guide the
55+
decoding. We construct a repository-level dataset
56+
PragmaticCode for method-completion in Java and evaluate
57+
MGD on it. On models of varying parameter scale, by
58+
monitoring for type-consistent object dereferences, MGD
59+
consistently improves compilation rates and agreement with
60+
ground truth. Further, LMs with fewer parameters, when
61+
augmented with MGD, can outperform larger LMs. With MGD,
62+
SantaCoder-1.1B achieves better compilation rate and
63+
next-identifier match than the much larger
64+
text-davinci-003 model.
65+
66+
67+
We also conduct a generalizability study to evaluate the
68+
ability of MGD to generalize to multiple programming
69+
languages (Java, C# and Rust), coding scenarios (e.g.,
70+
correct number of arguments to method calls), and to
71+
enforce richer semantic constraints (e.g., stateful API
72+
protocols). Our data and implementation are available at
73+
https://github.com/microsoft/monitors4codegen.
74+
keywords:
75+
- program analysis
76+
- correctness
77+
- code generation
78+
- Language models

CODE_OF_CONDUCT.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Microsoft Open Source Code of Conduct
2+
3+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
4+
5+
Resources:
6+
7+
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
8+
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
9+
- Contact [[email protected]](mailto:[email protected]) with questions or concerns

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Microsoft Corporation.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE

RAI_Transparency_Information.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Responsible AI Transparency Information
2+
## What is Monitor-Guided Decoding (MGD)?
3+
Monitor-Guided Decoding (MGD) is a tool for Language Models (LMs) to generate more reliable code. It combines the token-by-token LM decoding with Program Analysis techniques (a method that can check the syntax, semantics, and logic of code, such as the ones used in Integrated Development Environments). Under MGD, a software called monitor runs concurrently to the decoder, and iteratively uses results from continuous program analysis to prevent generation of potentially problematic tokens, such as identifiers that are inconsistent with the type definitions. For example, a type analysis is performed at identifier dereferences, to find the list of type-correct symbols, and prevent generation of type-invalid symbols, thus generating code free from a large class of compilation errors.
4+
5+
The static analysis in MGD is powered by Language Servers served over the Language Server Protocol. MGD takes as input a code repository, a partially completed code file within the repository, a prompt for the LM to generate the remaining code, and then uses a Language Model (from HuggingFace or OpenAI), to provide a code completion for it, while adhering to the monitored property.
6+
7+
## What can Monitor-Guided Decoding do?
8+
9+
MGD can improve the quality and reliability of code generation by LMs, especially when the code involves using types or functionality defined in another module, library, or when the LM has not seen such types or functionality during training (for example, the library version has upgraded with new APIs defined or private codebases). MGD can also prevent the LM from hallucinating non-existent dereferenced identifiers. Since MGD is prompt-agnostic, it can be used for various code generation tasks, such as code writing, code repair, code refactoring, code completion, etc., simply by changing the prompt. MGD can also be applied to any programming language for which a Language Server (The Language Server must declare “textDocument/completion” capability) is available.
10+
11+
## What is/are Monitor-Guided Decoding’s intended use(s)?
12+
13+
MGD is intended to be used as a research tool to advance the state of the art in and explore the potential of combining LM decoding with Program Analysis for code generation. It is also intended to be used as a baseline for evaluating and improving the performance of LMs on code generation tasks. It can be integrated in IDEs with LM based code-completion assistants; however, this use case has not been evaluated with users. MGD is not intended to be used as a substitute for human verification or testing of the generated code and does not provide guarantees for the generated code to be bug-free.
14+
15+
## How was Monitor-Guided Decoding evaluated? What metrics are used to measure performance?
16+
17+
MGD was evaluated on a dataset of open-source Java repositories from GitHub, called PragmaticCode, which contains code snippets with different levels of complexity and context. The dataset was used to curate a code benchmark, called DotPrompts (consisting of >10,000 testcases), which consists of prompts that require the LM to generate the remaining code for a partially completed nontrivial method. The benchmark is set up such that the LM must generate non-local identifier dereferences to complete the method.
18+
19+
MGD was applied to several off-the-shelf LMs of different sizes and domains, such as CodeGen-{350M, 2B, 6B}-Multi, SantaCoder-1.1B, and OpenAI text-davinci-003. The performance of LMs with and without MGD was measured using the following metrics:
20+
21+
1. Compilation Rate: Fraction of test cases, for which generated code compiled successfully
22+
2. Next Identifier Match: Fraction of test cases, for which generated next identifier is accurate
23+
3. Identifier Sequence Match: Percent prefix of ordered identifiers in the ground truth matched by the generated code
24+
4. Prefix Match: Percent prefix of ground truth matched by generated code
25+
26+
The metrics were aggregated over 6 indepedent trials for each testcase using the following aggregation:
27+
* score@k - estimate of best score achievable by the evaluated model, given k independent trials.
28+
29+
The results show that MGD consistently improved the ability of the LMs to generate code that compiles and matches the ground truth, across different metrics and models. MGD also outperformed the prompting technique on most metrics. MGD also demonstrated that LMs with fewer parameters, when guided with MGD, can outperform larger LMs without MGD.
30+
31+
## What are the limitations of Monitor-Guided Decoding? How can users minimize the impact of Monitor-Guided Decoding’s limitations when using the system?
32+
33+
MGD has some limitations that users should be aware of when using the system. Some of these limitations are:
34+
* The current instantiation of MGD monitors for type-consistent use of identifiers, which is one of the major sources of compilation errors in LM based code generation. However, there are other types of errors or bugs that MGD does not monitor or prevent, such as logical, syntactic, semantic, or runtime errors. Users should not rely on MGD to generate error-free code and should always verify and test the generated code for correctness and functionality.
35+
* MGD relies on the availability and accuracy of a Language Server for the programming language of interest. If the Language Server is not available, not compatible, or not reliable, MGD cannot be applied or may produce incorrect results. Users should ensure that the Language Server used is suitable and trustworthy.
36+
* MGD introduces some latency overhead to the code generation process, as it requires invoking the language server and masking the LM output iteratively. In our experiments, we find the latency overhead to not be significant, however, it may vary depending on the complexity of the code repository, size of the LM, speed of the static analysis, and the hardware and software configuration of the system.
37+
* MGD is a research tool that has not been extensively tested or validated with human users. It may not generalize well to domains and tasks that are beyond the scope of evaluation.
38+
39+
## What operational factors and settings allow for effective and responsible use of Monitor-Guided Decoding?
40+
41+
MGD has been shown to enhance the output of the LM by preventing a class of errors appearing in the generated code. However, the underlying generated code is still limited by the capability of the base LM.
42+
Some of the operational factors and settings that can enable effective and responsible use of MGD are:
43+
* Choosing an appropriate LM for the code generation task and the programming language of interest. Users should select an LM that has been trained on a relevant and diverse corpus of code. Users should also be aware of the limitations and assumptions of the LM, and how they may affect the code generation quality and reliability.
44+
* Reviewing and testing the generated code for correctness and functionality. Users should not blindly trust or use the generated code without verifying and testing it for errors, bugs, or vulnerabilities. Users should also document and acknowledge the use of MGD and the LM for their code generation task and cite the relevant sources and references.

0 commit comments

Comments
 (0)