Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract title from string containing mix of special characters and words in R

I have a long string containing a mix of words and characters.

<h4>        <a href="/forum?id=SyBPtQfAZ">          Improving Discriminator-Generator Balance in Generative Adversarial Networks        </a>          <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a>              </h4>

I need to extract only the title:

Improving Discriminator-Generator Balance in Generative Adversarial Networks

I know R has the ability to extract words between 2 characters, such as:

sub(">.*<", "", my_string)

But this obviously won't work here as there are a mix of many characters.

like image 667
Cybernetic Avatar asked Dec 13 '25 06:12

Cybernetic


1 Answers

You should probably be using an HTML parser here. That being said, the following one liner with gsub might work:

gsub(".*?<a href=[^>]*>\\s*(.*?)\\s*</a>.*", "\\1", input)

I say might because I make many assumptions, including that the title anchor tag is the first one, and that you don't have nested content. In practice, you can try using an HTML/XML parser for greater control.

Demo

like image 92
Tim Biegeleisen Avatar answered Dec 15 '25 04:12

Tim Biegeleisen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!