Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: remove line breaks from parts of string (PHP)

I want to remove all the line breaks and carriage returns from an XML file so all tags fit on one line each.

XML Source example:

<resources>
  <resource>
    <id>001</id>
    <name>Resource name 1</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>002</id>
    <name>Resource name 2</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.
</desc>
  </resource>
  <resource>
    <id>003</id>
    <name>Resource name 3</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor.
Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.
</desc>
  </resource>
</resources>

My take at it:

$pattern = "#(\t\t<[^>]*>[^<>]*)[\r\n]+([^<>]*</.*>)#";
$replacement = "$1$2";
$data = preg_replace($pattern, $replacement, $data);

This pattern corrects the 2nd resource and puts it back on its line. However, it doesn't correct the 2 line breaks from the 3rd resource, it only corrects one. The result is this:

<resources>
  <resource>
    <id>001</id>
    <name>Resource name 1</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>002</id>
    <name>Resource name 2</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>003</id>
    <name>Resource name 3</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor.
Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
</resources>

What's wrong with my pattern?

like image 686
realvedgie Avatar asked Dec 19 '25 08:12

realvedgie


2 Answers

The first [^<>]* in your regex initially gobbles up all of the remaining text, and then has to backtrack a ways so the rest of the regex can match. It only backtracks as far as it has to, i.e., to the last line break in the text. The rest of the regex is able to match what's left, so that's that.

But your regex would only match one line break in any case, because it consumes the whole text. It should consume only the part you want to remove. Check this out:

preg_replace('#[\r\n]+(?=[^<>]*</desc>)#', ' ', $data);

After the line break is found, the lookahead confirms that it was found inside a <desc> element. But the lookahead doesn't consume anything, so the next line break (if there is one) is still there to be matched on the next pass.

You can't have the lookahead match just any end tag (</\w+>) because that would let it match line breaks between elements as well as inside them. You can, however, enumerate the elements you want to work on:

</(?:desc|name|id)>
like image 104
Alan Moore Avatar answered Dec 20 '25 22:12

Alan Moore


Unless there's a lot more to what you're trying to do than you describe, I think you're making it way too complicated. You don't need nearly as complex a regex as you have. Try just using /\r?\n. This worked for me with your data:

$data = preg_replace("/\r?\n/", "", $data);
like image 21
Jeffrey Blake Avatar answered Dec 20 '25 22:12

Jeffrey Blake