Skip to content

feat(add,import): resolve --data-set-metadata to existing dataset IDs#438

Merged
SgtPooki merged 7 commits intomasterfrom
fix/435-data-set-metadata-subset-resolver
May 4, 2026
Merged

feat(add,import): resolve --data-set-metadata to existing dataset IDs#438
SgtPooki merged 7 commits intomasterfrom
fix/435-data-set-metadata-subset-resolver

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

@SgtPooki SgtPooki commented May 1, 2026

Closes #435.

What changed

--data-set-metadata previously had no effect on dataset selection because synapse-sdk's metadataMatches requires exact key/value equality. Migration datasets carry 4 keys (source, space-did, space-name, withIPFSIndexing); a request for source=storacha-migration alone failed to match and the SDK created a new dataset.

Both add and import now resolve --data-set-metadata locally before invoking the upload pipeline:

  • Match definition: requested keys are a subset of an existing dataset's metadata.
  • Match count vs --copies (default 2):
    • == copies → route to those dataset IDs, drop metadata from the upload request.
    • 0 matches → pass metadata through unchanged (current behavior creates a new dataset).
    • other → error with matched IDs and the expected count.

Resolver runs only when the caller has not pinned --data-set-ids or --provider-ids. Default flows (no --data-set-metadata) are unchanged.

How to verify

pnpm test

Live verification on calibration with migration datasets 13260 + 13261:

filecoin-pin add migration.json --data-set-metadata source=storacha-migration
# → ✓ Matched existing data sets 13260, 13261 via metadata filter
# → piece appended to both, no new dataset created

Negative path:

filecoin-pin add file.json --data-set-metadata source=does-not-exist
# → • No existing data sets matched --data-set-metadata; SDK will create a new data set with the requested metadata

Closes #435

synapse-sdk's metadataMatches requires exact key/value equality, so
passing --data-set-metadata source=storacha-migration against datasets
that also carry space-did/space-name fails to match and the SDK creates
a new dataset.

Resolve --data-set-metadata locally before invoking the upload pipeline
when the caller has not pinned dataSetIds/providerIds:

- Match: requested keys are a subset of an existing dataset's metadata.
- Outcome by match count vs --copies (default 2):
  - == copies: route to those dataset IDs, drop metadata from the
    upload request to avoid SDK exact-match interference.
  - 0 matches: pass metadata through unchanged (SDK creates a new
    dataset tagged with the requested metadata).
  - other: error with the matched IDs and expected count, suggest
    --copies <n> or --data-set-ids.

Verified against migration datasets 13260 + 13261 on calibration:
`add ... --data-set-metadata source=storacha-migration` reuses both
without creating new datasets.

Default `add` and `import` flows (no --data-set-metadata) are unchanged.
Copilot AI review requested due to automatic review settings May 1, 2026 13:53
@FilOzzy FilOzzy added team/filecoin-pin "Filecoin Pin" project is a stakeholder for this work. team/fs-wg FOC working group is a stakeholder for this work, and thus wants to track it on their project board. labels May 1, 2026
@FilOzzy FilOzzy added this to FOC May 1, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 1, 2026
@SgtPooki SgtPooki self-assigned this May 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds local subset-based resolution for --data-set-metadata so add/import can route uploads into existing datasets (instead of creating new ones) when the requested metadata keys are a subset of an existing dataset’s metadata.

Changes:

  • Introduces resolveDataSetIdsByMetadata to resolve dataset IDs via local subset matching against the caller’s datasets.
  • Updates filecoin-pin add and filecoin-pin import flows to use resolved dataSetIds (and stop forwarding user metadata) when match count equals --copies.
  • Adds unit tests for the resolver and updates add/import unit test mocks to support dataset listing.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/core/data-set/resolve-by-metadata.ts New resolver to subset-match requested metadata to existing live datasets and return IDs / ambiguity.
src/core/data-set/index.ts Exports the new resolver from the core data-set module.
src/add/add.ts Adds pre-upload resolution step for --data-set-metadata to dataSetIds when not pinned by IDs.
src/import/import.ts Same resolution behavior as add, applied to CAR import.
src/test/unit/resolve-by-metadata.test.ts Adds unit tests for resolver outcomes (no-match/matched/ambiguous).
src/test/unit/add.test.ts Updates Synapse mock to include getClientAddress and findDataSets for resolver path.
src/test/unit/import.test.ts Updates Synapse mock to include getClientAddress and findDataSets for resolver path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/import/import.ts
Comment thread src/import/import.ts
Comment thread src/core/data-set/resolve-by-metadata.ts Outdated
Comment thread src/test/unit/resolve-by-metadata.test.ts
Comment thread src/add/add.ts
Comment thread src/add/add.ts
Copy link
Copy Markdown
Collaborator Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

- Resolver: require key presence on dataset metadata (not value-only
  comparison). Previous `(metadata?.[key] ?? '') === value` falsely
  matched datasets missing the requested key when the requested value
  was empty string.
- Upload layer (src/core/upload/synapse.ts): suppress filecoin-pin's
  default metadata injection when the caller passes dataSetIds or
  pre-resolved contexts. SDK doesn't consult metadata on the
  dataset-id path today; suppressing here keeps intent honest and
  prevents future SDK changes from silently mismatching.
- Tests: rewire resolver tests so the mock applies the resolver's
  filter callback to raw fixtures, exercising the actual subset-match
  logic. Without this, the matching code never ran in tests.
- Tests: add integration coverage in add.test.ts and import.test.ts
  for the matched (drops metadata, sets dataSetIds) and ambiguous
  (throws) resolution paths.
@BigLep BigLep moved this from 📌 Triage to 🔎 Awaiting review in FOC May 1, 2026
Comment thread src/core/data-set/resolve-by-metadata.ts Outdated
Copy link
Copy Markdown
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this'll do for a start; I just have issue with the use of "ambiguous" here, both its inapplication and its lossy nature; I'll leave that one with you to ponder though, near enough is good enough for now

@github-project-automation github-project-automation Bot moved this from 🔎 Awaiting review to ✔️ Approved by reviewer in FOC May 4, 2026
@SgtPooki SgtPooki merged commit 495a734 into master May 4, 2026
15 checks passed
@SgtPooki SgtPooki deleted the fix/435-data-set-metadata-subset-resolver branch May 4, 2026 18:30
@github-project-automation github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team/filecoin-pin "Filecoin Pin" project is a stakeholder for this work. team/fs-wg FOC working group is a stakeholder for this work, and thus wants to track it on their project board.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resolve --data-set-metadata to existing datasets via local subset match

5 participants