C++: Remove all HTML formatting from string?

Question

I have a string which might include br or span.../span tags or other HTML characters/entities. I want a robust way of stripping all that and getting the remaining UTF-8 characters. This be should be cross-platform, ideally.

Something like this would be ideal:

http://snipplr.com/view/15261/python-decode-and-strip-html-entites-to-unicode/

but that also removes the tags.

Peter Ruderman · Accepted Answer

Just how stringent are your requirements? A simple two-state FSA ought to do. Start in the READCHAR state. Whenever you read a '<' in that state, transition to the READTAG state; otherwise, write the character to your result string. Whenever you're in the READTAG state and read a '>', transition back to the READCHAR state.

Edit: Oops. Missed the part of about entities. You'll nead a READENTITY state for that too. When you transition out of it, you could also convert the code into the corresponding UTF-8 character.

C++: Remove all HTML formatting from string?

Tags:

c++

c

html

decode

1 Answers

Peter Ruderman

Recent Activity

Donate For Us

C++: Remove all HTML formatting from string?

Tags:

c++

c

html

decode

1 Answers

Peter Ruderman

Related questions

Recent Activity

Donate For Us