Skip to content

Fix #405: no spurious space between emphasis and following punctuation#442

Open
assinscreedFC wants to merge 1 commit into
Alir3z4:masterfrom
assinscreedFC:fix/405-emphasis-punctuation-space
Open

Fix #405: no spurious space between emphasis and following punctuation#442
assinscreedFC wants to merge 1 commit into
Alir3z4:masterfrom
assinscreedFC:fix/405-emphasis-punctuation-space

Conversation

@assinscreedFC

Copy link
Copy Markdown

Summary

Fixes #405.

After a closing emphasis marker, html2text inserted a separating space before anything except whitespace, brackets and .!?. That wrongly added a space before other punctuation:

>>> import html2text
>>> html2text.html2text("<em>hello</em>,")
'_hello_ ,\n\n'      # expected '_hello_,'

Same for : " ; etc.

Change

In handle_data, the separating space after stressed (emphasis) text is only needed before a word character — which would otherwise attach to the closing _/* marker and stop Markdown from recognising the emphasis. Punctuation never merges with the marker, so it must not get a space. The condition changes from a broad blocklist:

re.match(r"[^][(){}\s.!?]", data[0])

to:

re.match(r"\w", data[0])

\w keeps the needed space before letters, digits and _ (all of which break a closing _), while dropping it before punctuation.

This is scoped to the punctuation case (#405). The separate **strong** + alphanumeric case (#413) uses a different code path and is left untouched.

Tests

Added a regression fixture test/emphasis_punctuation.html / .md (the test driver auto-discovers *.html/*.md pairs). It pins both behaviours: no space before punctuation/apostrophe, space kept before a following word or digit.

All 199 tests pass; black, isort, mypy, and flake8 are clean. ChangeLog and AUTHORS updated.


AI assistance (Claude) was used; I reviewed every line and ran the tests.

…ctuation

After a closing emphasis marker html2text inserted a separating space
before anything except whitespace, brackets and `.!?`. That wrongly added
a space before other punctuation, e.g. `<em>hello</em>,` produced
`_hello_ ,` instead of `_hello_,`.

The separating space is only needed before a word character, which would
otherwise attach to the closing marker and stop Markdown from recognising
the emphasis. The condition is now `re.match(r"\w", data[0])`.

Adds a regression fixture (test/emphasis_punctuation.*) and a ChangeLog
entry.

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extra space after a closing emphasis mark

1 participant