How to extract title from string containing mix of special characters and words in R

Question

I have a long string containing a mix of words and characters.

<h4>        <a href="/forum?id=SyBPtQfAZ">          Improving Discriminator-Generator Balance in Generative Adversarial Networks        </a>          <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a>              </h4>

I need to extract only the title:

Improving Discriminator-Generator Balance in Generative Adversarial Networks

I know R has the ability to extract words between 2 characters, such as:

sub(">.*<", "", my_string)

But this obviously won't work here as there are a mix of many characters.

Tim Biegeleisen · Accepted Answer

You should probably be using an HTML parser here. That being said, the following one liner with gsub might work:

gsub(".*?<a href=[^>]*>\s*(.*?)\s*</a>.*", "\1", input)

I say might because I make many assumptions, including that the title anchor tag is the first one, and that you don't have nested content. In practice, you can try using an HTML/XML parser for greater control.

How to extract title from string containing mix of special characters and words in R

Tags:

string

regex

r

extract

Cybernetic

1 Answers

Demo

Tim Biegeleisen

Recent Activity

Donate For Us

How to extract title from string containing mix of special characters and words in R

Tags:

string

regex

r

extract

Cybernetic

1 Answers

Demo

Tim Biegeleisen

Related questions

Recent Activity

Donate For Us