Skip to content

Skip XML comments and empty sequences in TrainingData iteration#56

Open
c-tonneslan wants to merge 1 commit into
datamade:mainfrom
c-tonneslan:fix/skip-comments-and-empty-sequences
Open

Skip XML comments and empty sequences in TrainingData iteration#56
c-tonneslan wants to merge 1 commit into
datamade:mainfrom
c-tonneslan:fix/skip-comments-and-empty-sequences

Conversation

@c-tonneslan
Copy link
Copy Markdown

TrainingData.__iter__ used to hand XML comment nodes and empty <TokenSequence/> elements straight to trainModel, which then crashed when it tried to iterate over the (non-existent) tokens. Skip both during iteration.

Concretely:

  • Comments are useful for organizing labeled examples in a single file. Ignoring them lets you keep grouping notes inline without breaking training.
  • Empty sequences (whether intentional or just an editing slip) shouldn't take the whole training run down.

The fix matches the snippet the issue author suggested. Added two unit tests in tests/test_xml.py (one for each skip case).

Closes #54.

The XML parser used to feed comments and empty <TokenSequence/> nodes
straight into trainModel, which then crashed when it tried to iterate
over the tokens. Skip both during __iter__ so callers can keep using
comments for grouping examples (and unintentional empty sequences
don't take the whole training run down with them).

Closes datamade#54.

Signed-off-by: Charlie Tonneslan <cst0520@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XML Parsing Does not Handle Comments or Empty Elements

1 participant