Skip XML comments and empty sequences in TrainingData iteration by c-tonneslan · Pull Request #56 · datamade/parserator

c-tonneslan · 2026-05-17T17:34:36Z

TrainingData.__iter__ used to hand XML comment nodes and empty <TokenSequence/> elements straight to trainModel, which then crashed when it tried to iterate over the (non-existent) tokens. Skip both during iteration.

Concretely:

Comments are useful for organizing labeled examples in a single file. Ignoring them lets you keep grouping notes inline without breaking training.
Empty sequences (whether intentional or just an editing slip) shouldn't take the whole training run down.

The fix matches the snippet the issue author suggested. Added two unit tests in tests/test_xml.py (one for each skip case).

Closes #54.

The XML parser used to feed comments and empty <TokenSequence/> nodes straight into trainModel, which then crashed when it tried to iterate over the tokens. Skip both during __iter__ so callers can keep using comments for grouping examples (and unintentional empty sequences don't take the whole training run down with them). Closes datamade#54. Signed-off-by: Charlie Tonneslan <cst0520@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip XML comments and empty sequences in TrainingData iteration#56

Skip XML comments and empty sequences in TrainingData iteration#56
c-tonneslan wants to merge 1 commit into
datamade:mainfrom
c-tonneslan:fix/skip-comments-and-empty-sequences

c-tonneslan commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

c-tonneslan commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant