Skip to content

1 delete orphan files #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

1 delete orphan files #35

wants to merge 8 commits into from

Conversation

subkanthi
Copy link
Collaborator

@subkanthi subkanthi commented Jun 23, 2025

closes:#1

@subkanthi subkanthi linked an issue Jun 23, 2025 that may be closed by this pull request

// Remove orphans.
OrphanFileScanner orphanFileScanner = new OrphanFileScanner(table);
orphanFileScanner.removeOrphanedFiles(olderThanMillis, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want a shorter ttl here

import software.amazon.awssdk.services.s3.model.ListObjectsV2Response;

public class OrphanFileScanner {
private static final Logger LOG = LoggerFactory.getLogger(OrphanFileScanner.class);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called logger in the rest of the codebase

String bucket = location.replace("s3://", "").split("/")[0];
String prefix = location.replace("s3://" + bucket + "/", "");

S3Client s3 = S3.newClient(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com.altinity.ice.cli.internal.* shouldn't be used here; true indicates "anonymous" access -> won't work;
can we use table io here (so that impl would be not tied to s3)?

deletedFiles.forEach(f -> LOG.info("Deleted: {}", f));
}

executor.shutdownNow();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already called shutdown on line 120

return allFiles;
}

public void removeOrphanedFiles(long olderThanMillis, boolean dryRun) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on current impl this method should be executed only when warehouse is s3 (not s3 tables, etc)


// Set<String> orphanedFiles = new HashSet<>();

allFiles.remove(knownFiles);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this supposed to be removeAll?

}
}

return knownFiles;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result looks incomplete. Inserting a single file into empty table yields

./metadata/00000-f62c927f-3965-42e9-8f3a-57809ca49f55.metadata.json
./metadata/4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1-m0.avro
./metadata/snap-7186194791324723249-1-4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1.avro
./metadata/00001-07fe7fa7-8fe1-4666-bd7b-7c88b9c861fe.metadata.json
./data/pickup_datetime_month=2009-01/1750715797525-2d861d7a4e5196bf04e5c3be506b0152f4170cf4e941424c8245b46903c3e252-part.parquet

i.e. we're going to nuke a bunch of critical files and break the catalog...

knownFiles.add(dataFile.path().toString());
}
} catch (Exception e) {
logger.error("Error getting list of data files", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network error here leads to us deleting files from catalog that are not supposed to be deleted

@subkanthi
Copy link
Collaborator Author

image

@subkanthi
Copy link
Collaborator Author

Tested by dropping a new file

2025-07-03 19:17:03 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Looking for Orphaned files in location s3://bucket1/flowers/iris
2025-07-03 19:21:37 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Found 1 orphaned files at s3://bucket1/flowers/iris!
2025-07-03 19:21:37 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Deleted 1 orphaned files at s3://bucket1/flowers/iris!
2025-07-03 19:21:37 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Deleted: s3://bucket1/flowers/iris/metadata/restore.sql

@subkanthi subkanthi marked this pull request as ready for review July 4, 2025 00:22
@subkanthi subkanthi requested a review from shyiko July 4, 2025 00:22
return;
}

if (dryRun) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dryRun should probably a config variable, so they can try before toggling it?


if (orphanFileExpirationDays == 0) {
logger.info(
"Skipping orphan file removal for table {} since orphanFileExpirationDays config was not set",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be avoided for every table if we have a separate iteration, not sure if this is the best option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ice-rest-catalog: Automate catalog maintenance
2 participants