Skip to content

[core] Skip stats from schema in DataEvolution when column type has changed#7803

Merged
JingsongLi merged 1 commit into
apache:masterfrom
ArnavBalyan:arnavb/fix-stats
May 11, 2026
Merged

[core] Skip stats from schema in DataEvolution when column type has changed#7803
JingsongLi merged 1 commit into
apache:masterfrom
ArnavBalyan:arnavb/fix-stats

Conversation

@ArnavBalyan
Copy link
Copy Markdown
Member

Purpose

  • Stats merges today ignores type changes during schema evolution. After alter table changes (eg Int to String), stats from older files written under the prior schema are merged in under the new type.
  • When types mismatch, this causes garbage stats to be accumulated which causing wrong pruning/silently data drop.
  • Ensure we can detect this mismatch and skip the stats for such cases.

Tests

  • UT

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The change is conservative and correct for data-evolution stats pruning: if a file contains the same field id but its stats were written with a different physical type than the current table field, using those min/max bytes with the current predicate type can produce invalid comparisons and wrong pruning. Skipping that file's stats for the field makes the predicate evaluation fall back to unknown stats instead of dropping data incorrectly. The added regression test covers the mixed old/new type case.

@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit da99cb2 into apache:master May 11, 2026
12 checks passed
@ArnavBalyan
Copy link
Copy Markdown
Member Author

Thanks @leaves12138 @JingsongLi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants