Skip to content

A possible issue with eland.Dataframe.value_counts(), the statistical information is missing some values #643

@mumuwithw

Description

@mumuwithw

I tried using eland to read data from two data streams, with es_index_pattern=["*java.backend*", "*h3c*"] , where field 'data_stream.dataset' is the name of the data stream of the document, and its value are 'h3c' and 'java.backend' in this example.
When I use 'df' to print the dataframe, I can indeed see 'h3c' data in the printed data, but when I use value_couts() for this field, only 'java.backend' appeared. I'm not sure whether this is a bug, because i saw a warning about this field when create the eland.DataFrame.

The code and returns are in the floowing:

>>> import eland as ed
>>> from elasticsearch import Elasticsearch
>>> import pandas as pd
>>> escli = Elasticsearch(
...         hosts="https://******",
...         basic_auth=("elastic", "***"),
...         ca_certs='./http_ca.crt',
...     )
>>> df = ed.DataFrame(
...     escli,
...     es_index_pattern=["*java.backend*", "*h3c*"],
...     columns=['@timestamp', 'message', 'data_stream.dataset'],
...     es_index_field='@timestamp'
...     )

# here is the warning mentioned before
......
xxxx\lib\site-packages\eland\field_mappings.py:327: UserWarning: Field data_stream.dataset has conflicting types ('constant_keyword', None) != text
......




# here 'data_stream.dataset' has both value of 'h3c' and 'java.backend'
>>> df
                                                     @timestamp  ... data_stream.dataset
2012-12-31T23:59:33.000+08:00         2012-12-31 23:59:33+08:00  ...                 h3c
2012-12-31T23:59:33.000+08:00         2012-12-31 23:59:33+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
...                                                         ...  ...                 ...
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:38:46.967Z       2023-12-19 07:38:46.967000+00:00  ...        java.backend

[42240705 rows x 3 columns]



# but here value_counts() only return info of 'java.backend'
>>> df['data_stream.dataset'].value_counts()
java.backend    42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(10) 
java.backend    42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(2)  
java.backend    42043023
Name: data_stream.dataset, dtype: int64

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic:dataframeIssue or PR about eland.DataFrame

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions