[feature request] Make C-extension threadsafe/Ractor-safe #3283
Replies: 7 comments 6 replies
-
|
Libxml2 doesn't support concurrent modifications the same document. See https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Thread-safety |
Beta Was this translation helpful? Give feedback.
-
|
So the way Ractors work is that only one of them can access a given document object at a time, so libxml2's limitation of not supporting concurrent modifications on the same document actually shouldn't be an issue: https://ruby-doc.org/core-3.0.0/Ractor.html What I'm hoping to avoid is that accessing different document objects can't be done concurrently, which is currently the case. According to the link you posted, libxml2 explicitly allows this as long as you:
So I'm hoping this actually should be trivial! (I'm addressing the use case of Ractors only here, since thats the only way it would happen in canonical, regular C-Ruby. |
Beta Was this translation helpful? Give feedback.
-
|
@eregon perhaps you or someone on the TruffleRuby team could lend a little more gravitas to my argument above? 😅 |
Beta Was this translation helpful? Give feedback.
-
|
@mohamedhafez Thanks for opening this issue. Earlier this year I spent some time exploring how ractors and the sqlite3 gem interact, so I have questions. Have you tried parsing and manipulating documents in different ractors? What was your experience like? What worked and what didn't work? Our mental model is that although libxml2 doesn't support concurrent operations within a single document, each ractor should be able to parse and manipulate a separate document, and I'd like to update our mental model if your experience has been something different. When you say "support for ractors" I'm trying to understand your specific use case, and what specific error message motivated you to open this issue. Passing objects between ractors can be hard for complex object graphs, and so any additional information you can provide would help me form better mental models. |
Beta Was this translation helpful? Give feedback.
-
|
@flavorjones so the mental model you mentioned, of each ractor should be able to parse and manipulate a separate document, is exactly what I'm hoping for. Currently, if you try to use Nokogiri in a Ractor, it will fail with a ~ $ curl 'https://nokogiri.org/tutorials/installing_nokogiri.html' > /tmp/installing_nokogiri.html
~ $ irb
3.3.3 :001 > require 'nokogiri'
=> true
3.3.3 :002 > Ractor.new { puts Nokogiri::HTML(File.open("/tmp/installing_nokogiri.html")).inspect }
(irb):2: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
=> #<Ractor:#2 (irb):2 blocking>
#<Thread:0x000000011f3d31d0 run> terminated with exception (report_on_exception is true):
/Users/mohamed/.rvm/gems/ruby-3.3.3/gems/nokogiri-1.16.6-arm64-darwin/lib/nokogiri/html4/document.rb:194:in `read_io': ractor unsafe method called from not main ractor (Ractor::UnsafeError)
from /Users/mohamed/.rvm/gems/ruby-3.3.3/gems/nokogiri-1.16.6-arm64-darwin/lib/nokogiri/html4/document.rb:194:in `parse'
from /Users/mohamed/.rvm/gems/ruby-3.3.3/gems/nokogiri-1.16.6-arm64-darwin/lib/nokogiri/html4.rb:11:in `HTML4'
from (irb):2:in `block in <top (required)>'This is the expected behavior: https://docs.ruby-lang.org/en/3.3/extension_rdoc.html#label-Appendix+F.+Ractor+support. According to that doc, the fix basically boils down to make sure you protect access to global variables with a Mutex, and make sure any external libraries like libxml2 are safe to access from different threads, and then call |
Beta Was this translation helpful? Give feedback.
-
|
@flavorjones, following @eregon reporting in #3283 (reply in thread) that nokogiri already configures libxml to be multithreaded, I've been running my test workload on TruffleRuby with the C-extension lock turned off, and no issues as far as I can tell! I've got 10 threads cycling through 50 jobs (each job consists of downloading a webpage and processing it with Nokogiri to pick out a bunch of info from it). Then there's a 1 second pause, then I repeat. I've had that running for a couple hours now and no problems! |
Beta Was this translation helpful? Give feedback.
-
|
Considering that making nokogiri Ractor-safe, or releasing the GVL before going into libxml2, both would require significant work that would be a ways off, perhaps a more easily achievable goal would be getting to the point where we could declare The reason this would be easier is we wouldn't have to worry about re-acquiring the GVL when libxml2 or the c-extension callback into Ruby, since TruffleRuby already runs Ruby without the GVL by default. The latest versions of nokogiri already run libxml2 with Searching through the codebase, I'm seeing many usages of static variables in c files, though i'm not sure if maybe they are all just static read-only constants. Unfortunately I don't have the time or really anything but the most rudimentary understanding of C, so I wouldn't be able to really audit the codebase to see if this would work, I'm just hoping to start the discussion here (sorry!!) Obviously understandable if others don't have the time either, just leaving this here and crossing my fingers! Oh also: note that in experiments I've run on my own codebase with TruffleRuby with the C-extension lock turned off, I didn't have any issues and everything worked as expected (see previous comment), so I am hoping there might not actually be many changes needed, if any! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In planning ahead to the near future when TruffleRuby can run C-extensions marked with
rb_ext_ractor_safe(true)in parallel, and for when Ractors are no longer just experimental, it would be great if the C-extension could be made threadsafe, or marked as such if it already is so!Beta Was this translation helpful? Give feedback.
All reactions