-
Notifications
You must be signed in to change notification settings - Fork 100
Open
Labels
helps: rapidsHelps or needed by RAPIDSHelps or needed by RAPIDStopic: hyperloglogIssue related to hyperloglogIssue related to hyperloglogtype: bugSomething isn't workingSomething isn't working
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug (https://github.com/NVIDIA/cuCollections/issues)
Type of Bug
Silent Failure
Describe the bug
When the deviation is 0.3 and calculate the following int values, Spark gets 81 while cuCo gets 80.
Details:
pyspark reproduce steps:
$SPARK_HOME/bin/pyspark
from pyspark.sql.functions import *
df = spark.createDataFrame([434971005, -1801141102, 1963272577, -493001830, -1087762159, 843441079, 959409252, 252071729, -1830233271, 820808802, -1535782039, 1531475465, 1642188005, 552222160, -194998970, 2109544455, 1405026214, 1672131131, 1247840828, -180033177, -1286780806, 933672832, 1401381638, -241603026, 615622263, -957425136, -276735314, -2009711680, -639722582, 974221725, 713012837, -1402812678, -546850329, -866141232, 848946484, -635203849, -1450175774, 844979905, 888971584, 1855780699, -1268565561, -1185513673, 1019479409, -1333229875, -1246182436, -2147483648, 900525526, 1006079044, -698588704, -943987698, 27695788, -84695147, -1441291062, 397673504, -392707402, 1290858625, 1420750585, -1178564290, 1921246226, 188935376, 6560145, -1928347973, 820364161, -401706971, -1118924186, 1759421546, -1350108963, 2097517825, -23883470, -1221269093, 1264159503, 97097882, 982791723, 638708040, -349593807, 361658100, 341780548, -4171545, 1095633384, -1694321873, 1777502952, -1699998259, -1432813716, 1113816192, -966808405, 1583478695, -650293396, 35500231, -440874147, 995739986, 207692068, 0, -1243401007, -1576220155, 1868986580, -87141217, 2108694405, -251958436, 2028975576, 1725957984, -354115601, 888726314, 1032487345, -1968749299, 1880817790, 1113480821, 789387254, -1724956749, -1201901245], "INT")
df.agg(approx_count_distinct("value", 0.3).alias('distinct_values')).show()
+---------------+
|distinct_values|
+---------------+
| 81|
+---------------+C++ reproduce steps:
#include <cuco/hyperloglog.cuh>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>
int main() {
using T = int;
thrust::host_vector<T> h_items = {434971005, -1801141102, 1963272577, -493001830, -1087762159, 843441079, 959409252, 252071729, -1830233271, 820808802, -1535782039, 1531475465, 1642188005, 552222160, -194998970, 2109544455, 1405026214, 1672131131, 1247840828, -180033177, -1286780806, 933672832, 1401381638, -241603026, 615622263, -957425136, -276735314, -2009711680, -639722582, 974221725, 713012837, -1402812678, -546850329, -866141232, 848946484, -635203849, -1450175774, 844979905, 888971584, 1855780699, -1268565561, -1185513673, 1019479409, -1333229875, -1246182436, -2147483648, 900525526, 1006079044, -698588704, -943987698, 27695788, -84695147, -1441291062, 397673504, -392707402, 1290858625, 1420750585, -1178564290, 1921246226, 188935376, 6560145, -1928347973, 820364161, -401706971, -1118924186, 1759421546, -1350108963, 2097517825, -23883470, -1221269093, 1264159503, 97097882, 982791723, 638708040, -349593807, 361658100, 341780548, -4171545, 1095633384, -1694321873, 1777502952, -1699998259, -1432813716, 1113816192, -966808405, 1583478695, -650293396, 35500231, -440874147, 995739986, 207692068, 0, -1243401007, -1576220155, 1868986580, -87141217, 2108694405, -251958436, 2028975576, 1725957984, -354115601, 888726314, 1032487345, -1968749299, 1880817790, 1113480821, 789387254, -1724956749, -1201901245};
thrust::device_vector<T> items = h_items;
auto const sd = cuco::standard_deviation{0.3};
// Initialize the estimator
cuco::hyperloglog<T> estimator{sd};
// Add all items to the estimator
estimator.add(items.begin(), items.end());
// Calculate the cardinality estimate
std::size_t const estimated_cardinality = estimator.estimate();
std::cout << "Estimated cardinality: " << estimated_cardinality << std::endl;
}
// result is 80
How to Reproduce
Please refer to the description above
Expected behavior
Keep the same behavior with Spark.
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
Metadata
Metadata
Assignees
Labels
helps: rapidsHelps or needed by RAPIDSHelps or needed by RAPIDStopic: hyperloglogIssue related to hyperloglogIssue related to hyperloglogtype: bugSomething isn't workingSomething isn't working