Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capture a line containing a "." or ":" but not ending with a period

I'm trying to create a regex containing character set which can contain a period or colon but may not end with a period. So I want to mach a line saying "Lorem./: Ipsom dolor sit" but not "Lorem ipsum dolor sit."

This is what my current regex looks like, but it's not working as it will match if the string ends on a period or colon:

/(\n{2,})([ \wåäöÅÄÖ,()%+\-:.]{2,75}[^.:])(\n{1,})/

I'm looking for headings in a huge, badly formatted plain text file. And the general pattern in this file is that a heading is always preceded by two newlines or more and always succeeded by one newline or more. Also a heading sometimes ends on a : but never on a . however they sometimes contain a . or :. Also they're always 2-75 characters long and never preceded by another heading.

Any help would be greatly appreciated.

Edit: I realised that my explanation where quite bad and partly wrong thus updated this post.

like image 438
Hultner Avatar asked Dec 08 '25 12:12

Hultner


1 Answers

In general, if you want to match a string not ending in a dot, just add (?<!\.)$ to the end of the regex.

This is a negative lookbehind assertion.

In your special case, the match is supposed to continue after this, though, so we need a different approach:

/\n{2,}([ \wåäöÅÄÖ,()%+\-:.]{2,75}(?<!\.))\n+/

will match any line that

  • follows two or more newlines (\n{2,}),
  • consists only of 2 to 75 allowed characters ([ \wåäöÅÄÖ,()%+\-:.]),
  • doesn't end in . ((?<!\.) - )
  • and is followed by at least one newline (\n+).

EDIT:

A new, expanded regex, trying to incorporate some of the logic discussed in the comments below; formatted as a verbose regex:

preg_match_all(
    '/(?<=\n\n)   # Assert that there are two newlines before the current position
    ^             # Assert that we\'re at the start of a line
    (?![\d -]+$)  # Assert that the line consists not solely of digits, spaces and -s
                  # Assert that the line doesn\'t consist of two Uppercase Words
    (?!\s*\p{Lu}\p{L}*\s+\p{Lu}\p{L}*\s*$)
                  # Match 2-75 of the allowed characters
    [ \wåäöÅÄÖ,()%+\-:.]{2,75}
    (?<!\.)       # Assert that the last one isn\'t a dot
    $             # Assert position at the end of a line
    (?=\n)        # Assert that one newline follows.
    /mxu', 
    $subject, $result, PREG_PATTERN_ORDER);
like image 128
Tim Pietzcker Avatar answered Dec 11 '25 03:12

Tim Pietzcker



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!