Optimization of REXML::XPathParser#step#310
Conversation
| new_nodes = {} | ||
| nodeset.each do |node| | ||
| new_nodeset = [] | ||
| new_nodes = {} |
There was a problem hiding this comment.
With this change, the result of this xpath has changed.
REXML::XPath.match(REXML::Document.new('<a id="a1"><b id="b1"><a id="a2"><b id="b2"><b id="b3"></b></b></a></b></a>'), '//a/descendant::b[1]')
# => [<b id='b1'> ... </>, <b id='b2'> ... </>] Expected (in master branch)
# => [<b id='b1'> ... </>] Actual (this pull request)Reference:
document.body.innerHTML='<a id="a1"><b id="b1"><a id="a2"><b id="b2"><b id="b3"></b></b></a></b></a>'
snapshot = document.evaluate('//a/descendant::b[1]', document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
[...new Array(snapshot.snapshotLength)].map((_, i) => snapshot.snapshotItem(i))
// => [b#b1, b#b2]There was a problem hiding this comment.
I'm sorry.
Since this change has unintended consequences, I will cancel the change to #descendant.
Thanks!
| next if seen.key?(raw_node.object_id) | ||
| seen[raw_node.object_id] = true |
There was a problem hiding this comment.
object_id will lazily assign a new id when it is called. We can avoid this by using seen = {}.compare_by_identity instead.
There was a problem hiding this comment.
Thanks for pointing that out.
I’ll fix it.
There was a problem hiding this comment.
Using compare_by_identity, Hash will compare by its identity, (same as comparing by object_id, but without assigning actual object_id), so seen.key?(raw_node) and seen[raw_node] = true is enough.
These two code gives the same result
h = {}
h[key.object_id] = value
h[key.object_id] #=> value
h[key.dup.object_id] #=> nilh = {}.compare_by_identity
h[key] = value
h[key] #=> value
h[key.dup] #=> nil5b3e01a to
2d3346f
Compare
## Benchmark
```
$ benchmark-driver benchmark/xpath.yaml
before after before(YJIT) after(YJIT)
REXML::XPath.match(REXML::Document.new(xml), 'a//a') 4.170k 4.300k 4.091k 4.087k i/s - 100.000 times in 0.023979s 0.023257s 0.024444s 0.024470s
REXML::XPath.match(REXML::Document.new(xml), '//a//a') 196.250 1.378k 293.984 1.701k i/s - 100.000 times in 0.509554s 0.072574s 0.340154s 0.058796s
Comparison:
REXML::XPath.match(REXML::Document.new(xml), 'a//a')
after: 4299.8 i/s
before: 4170.3 i/s - 1.03x slower
before(YJIT): 4091.0 i/s - 1.05x slower
after(YJIT): 4086.6 i/s - 1.05x slower
REXML::XPath.match(REXML::Document.new(xml), '//a//a')
after(YJIT): 1700.8 i/s
after: 1377.9 i/s - 1.23x slower
before(YJIT): 294.0 i/s - 5.79x slower
before: 196.3 i/s - 8.67x slower
```
- YJIT=ON : 0.99x - 5.79x faster
- YJIT=OFF : 1.03x - 7.01x faster
2d3346f to
6456800
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes REXML::XPathParser#step by deduplicating nodes earlier when merging multiple node sets, improving performance (notably for queries like //a//a) while preserving correct XPath node-set semantics (no duplicates, document order) and predicate behavior.
Changes:
- Deduplicate merged node sets in
XPathParser#stepusing identity-based tracking before sorting and assigning positions. - Add regression tests covering descendant-axis ordering, duplicate elimination, and position predicates across overlapping subtrees.
- Update the XPath benchmark to include a
//a//ascenario and reduce XML nesting depth.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| lib/rexml/xpath_parser.rb | Deduplicates merged raw nodes (by identity) before sorting and wrapping into XPathNodes. |
| test/xpath/test_base.rb | Adds regression tests for descendant-axis behavior (no duplicates, document order, predicate position semantics). |
| benchmark/xpath.yaml | Adds a new benchmark case for //a//a and reduces generated XML depth. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Benchmark