This repository was archived by the owner on Jan 9, 2020. It is now read-only.
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
HDFS access umbrella issue #128
Open
Description
We believe that accessing existing (remote) HDFS systems will be a common mode of input and output data. We want to support and test this mode of usage.
Some issues to pay attention to:
- HDFS version support. Since one is supposed to use client libraries that correspond with the HDFS server version, this affects both our testing and packaging.
- NameNode address support. Since its expected that many Spark apps will use the same HDFS (this is common usage) can we provide an option of how to specify the NameNode in a default configuration that means that each app config doesn't have to specify this. (Perhaps putting
spark.hadoop.fs.defaultFS=hdfs://<host>:<port>
in spark-defaults.conf is the right solution?) - Identity support. How will users identify their username to the HDFS. How will this be configurable? (For example
export HADOOP_USER_NAME=<username>
in conf/spark-env.sh) - Kerberos support. As an extension to the basic identity support, Kerberos is a commonly used mechanism to provide authentication of the identified user. When will we say we do/don't have support for Kerberos of these remote clusters.
- Non-support of HDFS locality. Since we generally expect that these clusters are remote to the Kubernetes cluster, we don't expect to implement any kind of HDFS locality optimization.
Other concerns?