Replies: 1 comment
-
The simple way to do it is just to create two generators, each with its own cache but both referencing the same model. They should work independently. The generator_1 = ExLlamaV2StreamingGenerator(model, cache_1, tokenizer)
generator_2 = ExLlamaV2StreamingGenerator(model, cache_2, tokenizer)
generator_1.begin_stream_ex(...)
generator_2.begin_stream_ex(...)
while True:
res_1 = generator_1.stream_ex()
res_2 = generator_1.stream_ex()
... Just note that even though the model is stateless it's not thread-safe. I plan to replace the generator/cache system with a more versatile paged attention scheme soon. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there!
I'm looking for a way to share model weights among two or more generators/caches.
The reason for this:
I want to keep one cache for my "main line" iterative generations and have other caches for auxiliary generations (mainly agent/verifier tasks). Of course I could use batching instead but that will result in performance going down because of fixed batch size, even if some slots are used (or do I get something wrong here?).
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions