Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to turn tabs into blockquotes using perl regex

Tags:

regex

perl

If I have HTML that are lines like the following: (\t means Tab character)

<P>\tSome text</P>
<P>\t\tSome text</P>
<P>\tSome text</P>

Using regex, how can I convert the above to:

<P><BLOCKQUOTE>Some text</BLOCKQUOTE></P>
<P><BLOCKQUOTE><BLOCKQUOTE>Some text</BLOCKQUOTE></BLOCKQUOTE></P>
<p><BLOCKQUOTE>Some text></BLOCKQUOTE></P>

At the moment I have:

for $line (@lines)
{
   $line =~ s{^(<P>(?:<BLOCKQUOTE>)*)\t(.+?)((?:</BLOCKQUOTE>)*</P>)$}{$1<BLOCKQUOTE>$2</BLOCKQUOTE>$3}g;
}
like image 726
CJ7 Avatar asked Dec 07 '25 14:12

CJ7


1 Answers

The tricky bit here is to somehow enter as many replacement tags as there are tabs.

I'd go with multiple passes, first counting the tabs and then going over the string again to replace them with the right number of open-close replacement tags (BLOCKQUOTE). In this case a single regex is bound to be much more complex and thus that much harder to tweak and maintain.

use warnings;
use strict;
use feature 'say';

my @test_strings = ( 
    qq(<p>\t\ttwo tabs</p>),
    qq(<p>\tone tab</p>),
    qq(<p>no tab</p>),
    qq(<div>\tnot paragraph</div>),
);

say for @test_strings;  say '';

for (@test_strings) 
{
    if (my ($tabs) = /<p>(\t+)/)          # capture consecutive tabs
    { 
        my $nt = () = $tabs =~ /\t/g;     # count them

        my $ot = "<BLOCKQUOTE>"  x $nt;   # open-tag
        my $ct = "</BLOCKQUOTE>" x $nt;   # close-tag

        s{<p> \t+ ([^\t].+?) </p>}{<p>$ot$1$ct</p>}x; 

        say;
    }       
}

Prints

<p>             two tabs</p>
<p>     one tab</p>
<p>no tab</p>
<div>   not paragraph</div>

<p><BLOCKQUOTE><BLOCKQUOTE>two tabs</BLOCKQUOTE></BLOCKQUOTE></p>
<p><BLOCKQUOTE>one tab</BLOCKQUOTE></p>
<p>no tab</p>
<div>   not paragraph</div>

Notes

  • As it stands this works with at most one paragraph (<p>...</p>) in the string, while

    while (my ($tabs) = /<p>(\t+)/g) { ... }
    

    (instead of if (...)) appears to work with multiple paragraphs. Needs more testing

  • Counting itself uses =()= "operator". It imposes list context on its right-hand side, so the regex returns the list of matches, assigned to a scalar on its left. Thus we get the count.

    In this case, with $tabs consisting of only the tab characters, one can simply do

     my $nt = split '', $tabs;
    

    (Update: really just my $nt = length $tabs;, as in other answers)

    I still use the regex since it'll work for a string with things other than just tabs, as well

  • The code replaces only the consecutive tabs in the beginning, right after <p>, not any that may come later in the string (how I see the requirement).

    This is provided for by following the tabs in the pattern (\t+) with a single non-tab character and then any characters, [^\t].*?. Thus this matches for a string with more tabs further down but replaces only the leading "block" of tabs

like image 61
zdim Avatar answered Dec 09 '25 16:12

zdim



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!