Skip to content

[BUG]: HLLPP in cuCo has different behavior with Spark #696

@res-life

Description

@res-life

Is this a duplicate?

Type of Bug

Silent Failure

Describe the bug

When the deviation is 0.3 and calculate the following int values, Spark gets 81 while cuCo gets 80.
Details:

pyspark reproduce steps:

$SPARK_HOME/bin/pyspark

from pyspark.sql.functions import *
df = spark.createDataFrame([434971005, -1801141102, 1963272577, -493001830, -1087762159, 843441079, 959409252, 252071729, -1830233271, 820808802, -1535782039, 1531475465, 1642188005, 552222160, -194998970, 2109544455, 1405026214, 1672131131, 1247840828, -180033177, -1286780806, 933672832, 1401381638, -241603026, 615622263, -957425136, -276735314, -2009711680, -639722582, 974221725, 713012837, -1402812678, -546850329, -866141232, 848946484, -635203849, -1450175774, 844979905, 888971584, 1855780699, -1268565561, -1185513673, 1019479409, -1333229875, -1246182436, -2147483648, 900525526, 1006079044, -698588704, -943987698, 27695788, -84695147, -1441291062, 397673504, -392707402, 1290858625, 1420750585, -1178564290, 1921246226, 188935376, 6560145, -1928347973, 820364161, -401706971, -1118924186, 1759421546, -1350108963, 2097517825, -23883470, -1221269093, 1264159503, 97097882, 982791723, 638708040, -349593807, 361658100, 341780548, -4171545, 1095633384, -1694321873, 1777502952, -1699998259, -1432813716, 1113816192, -966808405, 1583478695, -650293396, 35500231, -440874147, 995739986, 207692068, 0, -1243401007, -1576220155, 1868986580, -87141217, 2108694405, -251958436, 2028975576, 1725957984, -354115601, 888726314, 1032487345, -1968749299, 1880817790, 1113480821, 789387254, -1724956749, -1201901245], "INT")
df.agg(approx_count_distinct("value", 0.3).alias('distinct_values')).show()

+---------------+
|distinct_values|
+---------------+
|             81|
+---------------+

C++ reproduce steps:

#include <cuco/hyperloglog.cuh>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

int main() {

  using T                         = int;

  thrust::host_vector<T> h_items = {434971005, -1801141102, 1963272577, -493001830, -1087762159, 843441079, 959409252, 252071729, -1830233271, 820808802, -1535782039, 1531475465, 1642188005, 552222160, -194998970, 2109544455, 1405026214, 1672131131, 1247840828, -180033177, -1286780806, 933672832, 1401381638, -241603026, 615622263, -957425136, -276735314, -2009711680, -639722582, 974221725, 713012837, -1402812678, -546850329, -866141232, 848946484, -635203849, -1450175774, 844979905, 888971584, 1855780699, -1268565561, -1185513673, 1019479409, -1333229875, -1246182436, -2147483648, 900525526, 1006079044, -698588704, -943987698, 27695788, -84695147, -1441291062, 397673504, -392707402, 1290858625, 1420750585, -1178564290, 1921246226, 188935376, 6560145, -1928347973, 820364161, -401706971, -1118924186, 1759421546, -1350108963, 2097517825, -23883470, -1221269093, 1264159503, 97097882, 982791723, 638708040, -349593807, 361658100, 341780548, -4171545, 1095633384, -1694321873, 1777502952, -1699998259, -1432813716, 1113816192, -966808405, 1583478695, -650293396, 35500231, -440874147, 995739986, 207692068, 0, -1243401007, -1576220155, 1868986580, -87141217, 2108694405, -251958436, 2028975576, 1725957984, -354115601, 888726314, 1032487345, -1968749299, 1880817790, 1113480821, 789387254, -1724956749, -1201901245};
  thrust::device_vector<T> items = h_items;
  auto const sd = cuco::standard_deviation{0.3};

  // Initialize the estimator
  cuco::hyperloglog<T> estimator{sd};

  // Add all items to the estimator
  estimator.add(items.begin(), items.end());

  // Calculate the cardinality estimate
  std::size_t const estimated_cardinality = estimator.estimate();
  std::cout  << "Estimated cardinality: " << estimated_cardinality  << std::endl;
}

// result is 80

How to Reproduce

Please refer to the description above

Expected behavior

Keep the same behavior with Spark.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions