Skip to content

Conversation

@fxf8
Copy link

@fxf8 fxf8 commented Jan 25, 2024

Despite the fact that AnnoyIndex had methods to save and load the state of an index to a file, there was not any method to serialize the state of the index without saving to a file. This is important because you may want to use an AnnoyIndex object as a field in another class, and also be able to efficiently serialize the class that contains the index. The main benefit of this is that some serialization libraries may be able to serialize the aggregate class more efficiently in terms of size. This also avoids other complications such as file write permissions and searching for an adequate temp directiry and makes it easier to serialize the aggregate overall.

This pull request includes the following:

  • AnnoyIndex serialize and deserialize functions. The serialize method returns bytes which can be passed into the deserialize method to then 'load' the bytes to create the original serialized index.
  • Bindings for serialize and deserialize functions implemented for python, golang, and lua
  • Tests for the functionality and bindings in python, golang, and lua

So far, I have gotten all tests to pass in addition to the tests which I have implemented in each language. The tests that I have implemented simply test the result of get_nns_by_item before and after serializing and deserializing an index. Unless there is anything to change that I have overseen, it should be ready.

@erikbern
Copy link
Collaborator

Thanks for the contribution! Since Annoy stores all data in contiguous memory, wouldn't it be enough to just copy that memory buffer and then load from that? I'm not sure if we need to reimplement the logic to find the roots.

Also what happens if you serialize an index that has mmapped a file? We should probably disallow that right?

@fxf8
Copy link
Author

fxf8 commented Jan 28, 2024

@erikbern I have changed some of the code based on your suggestion to instead store the roots and other information in the serialized bytes so it does not need to be recalculated. I tested if serialization will still work properly even if the index has been loaded from a file by mmapping, and it seems to still work just fine based on the fact that the serialization result is the same. The main logic that is being implemented here is putting together the buffers that are pointed to within the AnnoyIndex class all into a single contiguous byte sequence which is later turned back into an index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants