feat: add ml/strided/dkmeans-init-plus-plus#12312
Conversation
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
| * Initializes centroids by performing the k-means++ initialization procedure. | ||
| * | ||
| * ## Method | ||
| * | ||
| * The k-means++ algorithm for choosing initial centroids is as follows: | ||
| * | ||
| * 1. Select a data point uniformly at random from a data set \\( X \\). This data point is first centroid and denoted \\( c_0 \\). | ||
| * | ||
| * 2. Compute the distance from each data point to \\( c_0 \\). Denote the distance between \\( c_j \\) and data point \\( m \\) as \\( d(x_m, c_j) \\). | ||
| * | ||
| * 3. Select the next centroid, \\( c_1 \\), at random from \\( X \\) with probability | ||
| * | ||
| * ```tex | ||
| * \frac{d^2(x_m, c_0)}{\sum_{j=0}^{n-1} d^2(x_j, c_0)} | ||
| * ``` | ||
| * | ||
| * where \\( n \\) is the number of data points. | ||
| * | ||
| * 4. To choose centroid \\( j \\), | ||
| * | ||
| * a. Compute the distances from each data point to each centroid and assign each data point to its closest centroid. | ||
| * | ||
| * b. For \\( i = 0,\ldots,n-1 \\) and \\( p = 0,\ldots,j-2 \\), select centroid \\( j \\) at random from \\( X \\) with probability | ||
| * | ||
| * ```tex | ||
| * \frac{d^2(x_i, c_p)}{\sum_{\{h; x_h \exits C_p\}} d^2(x_h, c_p)} | ||
| * ``` | ||
| * | ||
| * where \\( C_p \\) is the set of all data points closest to centroid \\( c_p \\) and \\( x_i \\) belongs to \\( c_p \\). | ||
| * | ||
| * Stated more plainly, select each subsequent centroid with a probability proportional to the distance from the centroid to the closest centroid already chosen. | ||
| * | ||
| * 5. Repeat step `4` until \\( k \\) centroids have been chosen. | ||
| * | ||
| * ## References | ||
| * | ||
| * - Arthur, David, and Sergei Vassilvitskii. 2007. "K-means++: The Advantages of Careful Seeding." In _Proceedings of the Eighteenth Annual Acm-Siam Symposium on Discrete Algorithms_, 1027–35. SODA '07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. <http://dl.acm.org/citation.cfm?id=1283383.1283494>. | ||
| * |
There was a problem hiding this comment.
I am not sure on which all files I should write this description, please guide me on this.
| * ]); | ||
| * | ||
| * var v = dkmeansInitPlusPlus( 'row-major', k, M, N, out, 2, xbuf, 2, 'sqeuclidean', 3, 44 ); | ||
| * // returns <Float64Array>[0,0,1,-1,1,1] |
There was a problem hiding this comment.
I hope this return value makes sense. As the user already has the strides it should be easy for them to access it, also the user is expected to pass in an empty initialized out array which will be filled with the flat centroids array.
| // Create a scratch array for storing cumulative probabilities: | ||
| probs = new Float64Array( M ); | ||
|
|
||
| // 2-5. For each data point, compute the distances to each centroid, find the closest centroid, and, based on the distance to the closest centroid, assign a probability to the data point to be chosen as centroid `c_j`... |
There was a problem hiding this comment.
Please note that all these comments in this implementation is directly referenced from ml/incr/kmeans/lib/init_kmeansplusplus
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: na
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: passed
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: na
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: passed
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
|
/stdlib merge |
---
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes.
report:
- task: lint_filenames
status: passed
- task: lint_editorconfig
status: passed
- task: lint_markdown_pkg_readmes
status: na
- task: lint_markdown_docs
status: na
- task: lint_markdown
status: na
- task: lint_package_json
status: na
- task: lint_repl_help
status: na
- task: lint_javascript_src
status: passed
- task: lint_javascript_cli
status: na
- task: lint_javascript_examples
status: na
- task: lint_javascript_tests
status: na
- task: lint_javascript_benchmarks
status: na
- task: lint_python
status: na
- task: lint_r
status: na
- task: lint_c_src
status: na
- task: lint_c_examples
status: na
- task: lint_c_benchmarks
status: na
- task: lint_c_tests_fixtures
status: na
- task: lint_shell
status: na
- task: lint_typescript_declarations
status: passed
- task: lint_typescript_tests
status: na
- task: lint_license_headers
status: passed
---
| if ( trials < 1 ) { | ||
| throw new TypeError( format( 'invalid argument. Thirteenth argument must be a valid trials (>=1). Value: `%s`.', trials ) ); | ||
| } |
There was a problem hiding this comment.
Would it be better to just return NaN?
Coverage Report
The above coverage report was generated for the changes in this PR. |
| if ( k < 1 || M < 1 || N < 1) { | ||
| return NaN; | ||
| } |
There was a problem hiding this comment.
Is this necessary?
- As it is a low-level kernel, we can expect user to pass in valid parameters.
- But if this check is not kept, for invalid params the API return
<Float64Array>[ NaN, NaN, NaN, ..., NaN ].
type: pre_commit_static_analysis_report
description: Results of running static analysis checks when committing changes. report:
Resolves None.
Description
This pull request:
ml/strided/dkmeans-init-plus-plus.Related Issues
This pull request has the following related issues:
Questions
No.
Other
No.
Checklist
AI Assistance
If you answered "yes" above, how did you use AI assistance?
Disclosure
I used Claude Code to compare the implementation with
ml/incr/kmeans/lib/init_kmeansplusplusand existing strided blas implementations, but the proposed changes were fully authored manually by myself.@stdlib-js/reviewers