-
Notifications
You must be signed in to change notification settings - Fork 0
1 delete orphan files #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
// Remove orphans. | ||
OrphanFileScanner orphanFileScanner = new OrphanFileScanner(table); | ||
orphanFileScanner.removeOrphanedFiles(olderThanMillis, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want a shorter ttl here
import software.amazon.awssdk.services.s3.model.ListObjectsV2Response; | ||
|
||
public class OrphanFileScanner { | ||
private static final Logger LOG = LoggerFactory.getLogger(OrphanFileScanner.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
called logger
in the rest of the codebase
String bucket = location.replace("s3://", "").split("/")[0]; | ||
String prefix = location.replace("s3://" + bucket + "/", ""); | ||
|
||
S3Client s3 = S3.newClient(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
com.altinity.ice.cli.internal.* shouldn't be used here; true indicates "anonymous" access -> won't work;
can we use table io here (so that impl would be not tied to s3)?
deletedFiles.forEach(f -> LOG.info("Deleted: {}", f)); | ||
} | ||
|
||
executor.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already called shutdown on line 120
return allFiles; | ||
} | ||
|
||
public void removeOrphanedFiles(long olderThanMillis, boolean dryRun) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on current impl this method should be executed only when warehouse is s3 (not s3 tables, etc)
|
||
// Set<String> orphanedFiles = new HashSet<>(); | ||
|
||
allFiles.remove(knownFiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this supposed to be removeAll?
} | ||
} | ||
|
||
return knownFiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result looks incomplete. Inserting a single file into empty table yields
./metadata/00000-f62c927f-3965-42e9-8f3a-57809ca49f55.metadata.json
./metadata/4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1-m0.avro
./metadata/snap-7186194791324723249-1-4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1.avro
./metadata/00001-07fe7fa7-8fe1-4666-bd7b-7c88b9c861fe.metadata.json
./data/pickup_datetime_month=2009-01/1750715797525-2d861d7a4e5196bf04e5c3be506b0152f4170cf4e941424c8245b46903c3e252-part.parquet
i.e. we're going to nuke a bunch of critical files and break the catalog...
knownFiles.add(dataFile.path().toString()); | ||
} | ||
} catch (Exception e) { | ||
logger.error("Error getting list of data files", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
network error here leads to us deleting files from catalog that are not supposed to be deleted
…eve snapshot metadata JSON files for orphan file deletion.
Tested by dropping a new file
|
return; | ||
} | ||
|
||
if (dryRun) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dryRun should probably a config variable, so they can try before toggling it?
|
||
if (orphanFileExpirationDays == 0) { | ||
logger.info( | ||
"Skipping orphan file removal for table {} since orphanFileExpirationDays config was not set", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this can be avoided for every table if we have a separate iteration, not sure if this is the best option.
closes:#1