Skip to content

Conversation

wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Apr 18, 2023

Reasons for making this change

#4370 does not have synchronization between threads (upload file thread and generate index thread), so it does not work for large files like imagenet (158 GB).

Related issues

Screenshots

Checklist

  • I've added a screenshot of the changes, if this is a frontend change
  • I've added and/or updated tests, if this is a backend change
  • I've run the pre-commit.sh script
  • I've updated docs, if needed

Copy link
Member

@epicfaace epicfaace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove the .pyc files? Also let's add them to .gitignore

account_name=AZURE_BLOB_ACCOUNT_NAME,
container_name=AZURE_BLOB_CONTAINER_NAME,
account_key=AZURE_BLOB_ACCOUNT_KEY,
expiry=datetime.datetime.now() + datetime.timedelta(hours=1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we refactor this into a constant? Also, is 10 hours enough or might we need more for even larger files?

self._current_max_buf_length = len(self._bufs[0])
for i in range(1, self.NUM_READERS):
self._current_max_buf_length = max(self._current_max_buf_length, len(self._bufs[i]))
# print(f"find largest buffer: {self._current_max_buf_length} in thread: {threading.current_thread().name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments

"""
NUM_READERS = 2

# MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Can you add a description of what MAX_BUF_SIZE is used for?

NUM_READERS = 2

# MAX memory usage <= MAX_BUF_SIZE + max(num_bytes called in read)
MAX_BUF_SIZE = 1024 * 1024 * 1024 # 10 MiB for test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value should it be for non-test?

self.find_largest_buffer()

def find_largest_buffer(self):
self._current_max_buf_length = len(self._bufs[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring comment


def check_file_target_contents(self, target):
"""Checks to make sure that the specified file has the contents 'hello world'."""
# This can not be checked, Since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants