HDFS access umbrella issue

We believe that accessing existing (remote) HDFS systems will be a common mode of input and output data. We want to support and test this mode of usage.

Some issues to pay attention to:

- HDFS version support. Since one is supposed to use client libraries that correspond with the HDFS server version, this affects both our testing and packaging.
- NameNode address support. Since its expected that many Spark apps will use the same HDFS (this is common usage) can we provide an option of how to specify the NameNode in a default configuration that means that each app config doesn't have to specify this. (Perhaps putting `spark.hadoop.fs.defaultFS=hdfs://<host>:<port>` in spark-defaults.conf is the right solution?)
- Identity support. How will users identify their username to the HDFS. How will this be configurable? (For example `export HADOOP_USER_NAME=<username>` in conf/spark-env.sh)
- Kerberos support. As an extension to the basic identity support, Kerberos is a commonly used mechanism to provide authentication of the identified user. When will we say we do/don't have support for Kerberos of these remote clusters.
- Non-support of HDFS locality. Since we generally expect that these clusters are remote to the Kubernetes cluster, we don't expect to implement any kind of HDFS locality optimization.
Other concerns?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS access umbrella issue #128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HDFS access umbrella issue #128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions