|
1 | 1 | # Contributing to the Databricks Labs Data Generator
|
2 |
| -We happily welcome contributions to *dbldatagen*. |
3 |
| - |
4 |
| -We use GitHub Issues to track community reported issues and GitHub Pull Requests for accepting changes. |
5 |
| - |
6 |
| -## License |
7 |
| - |
8 |
| -When you contribute code, you affirm that the contribution is your original work and that you |
9 |
| -license the work to the project under the project's Databricks license. Whether or not you |
10 |
| -state this explicitly, by submitting any copyrighted material via pull request, email, or |
11 |
| -other means you agree to license the material under the project's Databricks license and |
12 |
| -warrant that you have the legal authority to do so. |
13 |
| - |
14 |
| -# Development Setup |
15 |
| - |
16 |
| -## Python Compatibility |
17 |
| - |
18 |
| -The code supports Python 3.10+ and has been tested with Python 3.10 and later. |
19 |
| - |
20 |
| -## Quick Start |
21 |
| - |
| 2 | +While **dbldatagen** cannot accept direct contribution from external contributors, all users can create GitHub Issues to propose new functionality. The dbldatagen team will review and prioritize new features based on user feedback. |
| 3 | + |
| 4 | +## Making a contribution |
| 5 | + |
| 6 | +### Setup |
| 7 | +To set up your local environment: |
| 8 | + |
| 9 | +1. Ensure any [Non-Python Dependencies](#other-dependencies) are installed locally. |
| 10 | +2. Clone the repository: |
| 11 | + ```bash |
| 12 | + git clone "repository URL" |
| 13 | + ```` |
| 14 | + |
| 15 | +3. Open the repository in your IDE. Run the following terminal command to create a local development environment: |
| 16 | + ```bash |
| 17 | + make dev |
| 18 | + ``` |
| 19 | + |
| 20 | +### Development |
| 21 | +When contributing new functionality: |
| 22 | + |
| 23 | +1. Sync changes from the `master` branch: |
| 24 | + ```bash |
| 25 | + git checkout main && git pull |
| 26 | + ``` |
| 27 | +2. Checkout a new branch from `master`: |
| 28 | + ```bash |
| 29 | + git checkout -b "branch name" |
| 30 | + ``` |
| 31 | +3. Add your functionality, tests, documentation, and examples. |
| 32 | + |
| 33 | +### Formatting |
| 34 | +dbldatagen aims to follow [PEP8 standards](https://peps.python.org/pep-0008/). Code style should be checked for any new commits. |
| 35 | + |
| 36 | +To validate code locally: |
| 37 | + |
| 38 | +1. Run the following terminal command in your IDE: |
| 39 | + ```bash |
| 40 | + make fmt |
| 41 | + ``` |
| 42 | +2. Fix any issues until no messages remain. |
| 43 | + |
| 44 | +### Testing |
| 45 | +dbldatagen aims to have the highest possible test coverage. Code should be tested for any new commits. |
| 46 | + |
| 47 | +To run unit tests locally: |
| 48 | + |
| 49 | +1. Run the following terminal command in your IDE: |
| 50 | + ```bash |
| 51 | + make test-coverage |
| 52 | + ``` |
| 53 | +2. Verify that all tests pass. |
| 54 | +3. Open the coverage report in your browser. |
| 55 | +4. Verify that all modified modules have full coverage. |
| 56 | + |
| 57 | +### Submitting a PR |
| 58 | +To submit a pull request: |
| 59 | + |
| 60 | +1. Squash all local commits in your branch. |
| 61 | +2. Push your changes: |
| 62 | + ```bash |
| 63 | + git push |
| 64 | + ``` |
| 65 | +3. Navigate to the [Pull Requests](https://github.com/databrickslabs/dbldatagen/pulls) page and click **New pull request**. |
| 66 | +4. Complete the template. |
| 67 | +5. Submit your PR. |
| 68 | + |
| 69 | +## Building the project locally |
| 70 | + |
| 71 | +### Building the HTML documentation |
| 72 | +Documentation can be reviewed locally. To build and open the documentation in your browser, run the following terminal command: |
22 | 73 | ```bash
|
23 |
| -# Install development dependencies |
24 |
| -make dev |
25 |
| - |
26 |
| -# Format and lint code |
27 |
| -make fmt # Format with ruff and fix issues |
28 |
| -make lint # Check code quality |
29 |
| - |
30 |
| -# Run tests |
31 |
| -make test # Run tests |
32 |
| - |
33 |
| -# Build package |
34 |
| -make build # Build with modern build system |
| 74 | +make docs-serve |
35 | 75 | ```
|
36 | 76 |
|
37 |
| -## Development Tools |
38 |
| - |
39 |
| -All development tools are configured in `pyproject.toml`. |
40 |
| - |
41 |
| -## Dependencies |
42 |
| - |
43 |
| -All dependencies are defined in `pyproject.toml`: |
44 |
| - |
45 |
| -- `[project.dependencies]` lists dependencies necessary to run the `dbldatagen` library |
46 |
| -- `[tool.hatch.envs.default]` lists the default environment necessary to develop, test, and build the `dbldatagen` library |
47 |
| - |
48 |
| -## Spark Dependencies |
49 |
| - |
50 |
| -The builds have been tested against Spark 3.4.1+. This requires OpenJDK 1.8.56 or later version of Java 8. |
51 |
| -The Databricks runtimes use the Azul Zulu version of OpenJDK 8. |
52 |
| -These are not installed automatically by the build process. |
53 |
| - |
54 |
| -## Creating the HTML documentation |
55 |
| - |
56 |
| -Run `make docs` from the main project directory. |
57 |
| - |
58 |
| -The main html document will be in the file (relative to the root of the build directory) |
59 |
| - `./docs/docs/build/html/index.html` |
60 |
| - |
61 |
| -## Building the Python wheel |
| 77 | +### Building the Python wheel |
| 78 | +dbldatagen can be built locally as a Python wheel. To build the wheel, run the following terminal command: |
62 | 79 |
|
63 | 80 | ```bash
|
64 |
| -make build # Clean and build the package |
| 81 | +make build |
65 | 82 | ```
|
66 | 83 |
|
67 |
| -# Testing |
68 |
| - |
69 |
| -## Developing new tests |
70 |
| -New tests should be created using PyTest with classes combining multiple `Pytest` tests. |
71 |
| - |
72 |
| -Existing test code contains tests based on Python's `unittest` framework but these are |
73 |
| -run on `pytest` rather than `unitest`. |
| 84 | +## Prerequisites |
74 | 85 |
|
75 |
| -To get a `spark` instance for test purposes, use the following code: |
| 86 | +### Python Compatibility |
| 87 | +dbldatagen supports Python 3.10+ and is tested with Python 3.10 and later. |
76 | 88 |
|
77 |
| -```python |
78 |
| -import dbldatagen as dg |
79 |
| - |
80 |
| -spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>") |
81 |
| -``` |
82 |
| - |
83 |
| -The name used to flag the spark instance should be the test module or test class name. |
84 |
| - |
85 |
| -## Running Tests |
86 |
| - |
87 |
| -```bash |
88 |
| -# Run all tests |
89 |
| -make test |
90 |
| - |
91 |
| -If using an environment with multiple Python versions, make sure to use virtual env or similar to pick up correct python versions. |
92 |
| - |
93 |
| -If necessary, set `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to correct versions of Python. |
94 |
| - |
95 |
| -# Using the Databricks Labs data generator |
96 |
| -The recommended method for installation is to install from the PyPi package |
97 |
| - |
98 |
| -You can install the library as a notebook scoped library when working within the Databricks |
99 |
| -notebook environment through the use of a `%pip` cell in your notebook. |
100 |
| - |
101 |
| -To install as a notebook-scoped library, create and execute a notebook cell with the following text: |
102 |
| - |
103 |
| -> `%pip install dbldatagen` |
104 |
| - |
105 |
| -This installs from the PyPi package |
106 |
| - |
107 |
| -You can also install from release binaries or directly from the Github sources. |
108 |
| - |
109 |
| -The release binaries can be accessed at: |
110 |
| -- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases |
111 |
| - |
112 |
| - |
113 |
| -The `%pip install` method also works on the Databricks Community Edition. |
114 |
| - |
115 |
| -Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based |
116 |
| -library into your workspace. |
117 |
| - |
118 |
| -The `%pip install` method can also down load a specific binary release. |
119 |
| -For example, the following code downloads the release V0.2.1 |
120 |
| - |
121 |
| -> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl' |
| 89 | +### Development Tools |
| 90 | +All development tools are configured in `pyproject.toml`. |
122 | 91 |
|
123 |
| -# Code Quality and Style |
| 92 | +### Python Dependencies |
| 93 | +All Python dependencies are defined in `pyproject.toml`: |
124 | 94 |
|
125 |
| -## Automated Formatting |
| 95 | +1. `[project.dependencies]` lists dependencies installed with the `dbldatagen` library |
| 96 | +2. `[tool.hatch.envs.default]` lists the default environment necessary to develop, test, and build the `dbldatagen` library |
126 | 97 |
|
127 |
| -Code can be automatically formatted and linted with the following commands: |
| 98 | +### Non-Python Dependencies |
| 99 | +dbldatagen is tested against Databricks Runtime version 13.3LTS and OpenJDK 17. |
128 | 100 |
|
129 |
| -```bash |
130 |
| -# Format code and fix issues automatically |
131 |
| -make fmt |
| 101 | +Spark and Java dependencies are not installed automatically by the build process and should be installed manually to develop and run dbldatagen locally. |
132 | 102 |
|
133 |
| -# Check code quality without making changes |
134 |
| -make lint |
135 |
| -``` |
| 103 | +## Development standards |
136 | 104 |
|
137 |
| -## Coding Conventions |
| 105 | +### Code style |
| 106 | +All code should adhere to the following standards: |
138 | 107 |
|
139 |
| -The code follows PySpark coding conventions: |
140 |
| -- Python PEP8 standards with some PySpark-specific adaptations |
141 |
| -- Method and argument names use mixed case starting with lowercase (following PySpark conventions) |
142 |
| -- Line length limit of 120 characters |
| 108 | +1. **Formatted and linted** to PEP8 standards. |
| 109 | +2. **Type-validated** using [mypy](https://mypy-lang.org/). |
| 110 | +3. **Clearly-named** variables, classes, and methods. |
| 111 | +4. **Include docstrings** that detail functionality and usage. |
143 | 112 |
|
144 |
| -See the [Python PEP8 Guide](https://peps.python.org/pep-0008/) for general Python style guidelines. |
| 113 | +### Testing |
| 114 | +All tests should use [pytest](https://docs.pytest.org/en/stable/) with fixtures and parameterization where appropriate. This includes: |
145 | 115 |
|
146 |
| -# Github expectations |
147 |
| -When running the unit tests on GitHub, the environment should use the same environment as the latest Databricks |
148 |
| -runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 onwards, |
149 |
| -unit tests will be run on the environment corresponding to the latest LTS release. |
| 116 | +1. **Unit tests** cover functionality which does not require a Databricks workspace and should always be preferred to integration tests when possible. |
| 117 | +2. **Integration tests** cover functionality which requires Databricks compute, Unity Catalog, or other workspace features. |
150 | 118 |
|
151 |
| -Libraries will use the same versions as the earliest supported LTS release - currently 13.3 LTS |
| 119 | +### Branches |
| 120 | +All local development should branch from `master` and adhere to the following naming convention: |
152 | 121 |
|
153 |
| -This means for the current build: |
| 122 | +1. `feat_<feature_name>` for new functionality |
| 123 | +2. `fix_<issue_number>_<fix_name>` for bugfixes |
154 | 124 |
|
155 |
| -- Use of Ubuntu 22.04 for the test runner |
156 |
| -- Use of Java 8 |
157 |
| -- Use of Python 3.10.12 when testing / building the image |
| 125 | +### Pull requests |
| 126 | +All pull requests should adhere to the following standards: |
158 | 127 |
|
159 |
| -See the following resources for more information |
160 |
| -= https://docs.databricks.com/en/release-notes/runtime/15.4lts.html |
161 |
| -- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html |
162 |
| -- https://github.com/actions/runner-images/issues/10636 |
| 128 | +1. Pull requests should be scoped to 1 repository issue. |
| 129 | +2. Local commits should be squashed on your branch before opening a pull request. |
| 130 | +3. All pull requests should include functionality, tests, documentation, and examples. |
163 | 131 |
|
| 132 | +## License |
| 133 | +When you contribute code, you affirm that the contribution is your original work and that you |
| 134 | +license the work to the project under the project's Databricks license. Whether or not you |
| 135 | +state this explicitly, by submitting any copyrighted material via pull request, email, or |
| 136 | +other means you agree to license the material under the project's Databricks license and |
| 137 | +warrant that you have the legal authority to do so. |
0 commit comments