Skip to content

Unable to use Zstd compression with parquet  #764

@kartik18

Description

@kartik18

Hi Everyone,

I'm using the cp-kafka-connect-base:7.50 docker image with the kafka-connect-s3:10.5.13 plugin installed inside it. We can write data in Parquet format from Kafka topic to S3.
When I'm trying to use the configuration parquet.code: zstd for writing the data in a compressed way, I'm facing an error. With a stack trace -
"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:618)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:336)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:237)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:206)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)\n\tat org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.\n\tat org.apache.hadoop.io.compress.ZStandardCodec.checkNativeCodeLoaded(ZStandardCodec.java:65)\n\tat org.apache.hadoop.io.compress.ZStandardCodec.getCompressorType(ZStandardCodec.java:153)\n\tat org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)\n\tat org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)\n\tat org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.<init>(CodecFactory.java:144)\n\tat org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)\n\tat org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)\n\tat org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:287)\n\tat org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:564)\n\tat io.confluent.connect.s3.format.parquet.ParquetRecordWriterProvider$1.write(ParquetRecordWriterProvider.java:102)\n\tat io.confluent.connect.s3.format.S3RetriableRecordWriter.write(S3RetriableRecordWriter.java:51)\n\tat io.confluent.connect.s3.format.KeyValueHeaderRecordWriterProvider$1.write(KeyValueHeaderRecordWriterProvider.java:114)\n\tat io.confluent.connect.s3.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:592)\n\tat io.confluent.connect.s3.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:327)\n\tat io.confluent.connect.s3.TopicPartitionWriter.executeState(TopicPartitionWriter.java:267)\n\tat io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:218)\n\tat io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:244)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:587)\n\t...

I see there is already an issue raised earlier - #570

Question -

  1. Looking at the pom.xml file, it is mentioned Hadoop Version 3.3.6 is used and libhadoop.so.1.0.0 is already zstd compatible. Then why we are facing this issue with the latest version?
  2. Even after replicating the solution mentioned by one of the users (github-louis-fruleux), I'm still facing the same issue.
    P.S. What I did was
  • Downloaded - https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
  • Unzip and copy libhadoop.so.1.0.0 in the docker image
  • Dockerfile - FROM confluentinc/cp-kafka-connect-base:7.50 ENV CONNECT_PLUGIN_PATH=/usr/share/java,/usr/share/confluent-hub-components RUN confluent-hub install --no-prompt confluentinc/kafka-connect-s3:10.5.13 COPY libhadoop.so.1.0.0 /usr/lib64/libhadoop.so.1.0.0 COPY scripts/libhadoop.so.1.0.0 /usr/lib64/libhadoop.so RUN ls -ltr /usr/lib64/ ENV KAFKA_OPTS="${KAFKA_OPTS} -Djava.library.path=/usr/lib64/" RUN echo "KAFKA_OPTS value: $KAFKA_OPTS" CMD ["/etc/confluent/docker/run"]

How should I fix this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions