Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to exclude occurrences after a positive lookbehind?

Suppose I have the following markdown list items:

- [x] Example of a completed task.
- [x] ! Example of a completed task.
- [x] ? Example of a completed task.

I am interested to parse that item using regex and extract the following group captures:

  • $1: the left [ and the right ] brackets when the symbol x is in-between
  • $2: the symbol x in between the brackets [ and ]
  • $3: the modifier ! that follows after [x]
  • $4: the modifier ? that follows after [x]
  • $5: the text that follows [x] without a modifier, e.g., [x] This is targeted.
  • $6: the text that follows [x] !
  • $7: the text that follows [x] ?

After a lot of trial-and-error using online parsers, I came up with the following:

((?<=x)\]|\[(?=x]))|((?<=\[)x(?=\]))|((?<=\[x\]\s)!(?=\s))|((?<=\[x\]\s)\?(?=\s))|((?<=\[x\]\s)[^!?].*)|((?<=\[x\]\s!\s).*)|((?<=\[x\]\s\?\s).*)

To make the regex above more readable, these are the capture groups listed one by one:

  • $1: ((?<=x)\]|\[(?=x]))
  • $2: ((?<=\[)x(?=\]))
  • $3: ((?<=\[x\]\s)!(?=\s))
  • $4: ((?<=\[x\]\s)\?(?=\s))
  • $5: ((?<=\[x\]\s)[^!?].*)
  • $6: ((?<=\[x\]\s!\s).*)
  • $7: ((?<=\[x\]\s\?\s).*)

This is most likely not the best way to do it, but at least it seems to capture what I want:

Matches for the example list items

I would like to extend that regex to capture lines in a markdown table that looks like this:

|       | Task name                               |    Plan     |   Actual    |      File      |
| :---- | :-------------------------------------- | :---------: | :---------: | :------------: |
| [x]   | Task one with a reasonably long name.   | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |
| [x] ! | Task two with a reasonably long name.   | 09:00-09:30 |             |  [[task-two]]  |
| [x] ? | Task three with a reasonably long name. | 11:00-13:00 |             | [[task-three]] |

More specifically, I am interested in having the same group captures as above, but I would like to exclude the table grid (i.e., the |). So, groups $1 to $4 should stay the same, but groups $5 to $7 should capture the text, excluding the |, e.g., like in the selection below:

Matches for the example table

Do you have any ideas on how I can adjust, for example, the regex for group $5 to exclude the |. I have endlessly tried all sorts of negations (e.g., [^\|]). I am using Oniguruma regular expressions.

like image 205
Mihai Avatar asked Nov 19 '25 19:11

Mihai


2 Answers

Inspired by the answer by Wiktor , check the following regex, which is quite short

(?:\G(?<!\A)\||(?:\[x]\s[?!]?\s*\|?))\K([^|\n]*)

The explanation to above

1.\G(?!\A)\|

\G asserts position at the end of the previous match or the start of the string for the first match. Negative Lookbehind (?!\A)

  1. \A asserts position at start of the string
  2. | matches the character |
  1. (?:\[x]\s[?!]?\s*\|?)

Non-capturing group. That matches [x], \s (space), [?|!] (zero or 1) followed by \s* (zero or more) and a | (zero or one)

  1. \K

\K resets the starting point of the reported match.

  1. ([^|\n]*)

All characters except | or \n (newline) matches previous token zero or unlimited times.

like image 77
nps Avatar answered Nov 22 '25 08:11

nps


You can use

((?<=x)]|\[(?=x]))|((?<=\[)x(?=]))|((?<=\[x]\s)!(?=\s))|(?<=\[x]\s)(\?)(?=\s)|(?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|)

See the regex101 PCRE and a Ruby (Onigmo/Oniguruma) demos.

What is added? The (?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|) part:

  • (?: - start of a non-capturing group (a custom boundary here, we'll match...)
    • \G(?!\A)\| - either the end of the previous match and a | char (i.e. | must immediately follow the previous match),
    • |(?<=\[x]\s[?!\s]\s\|) - or a location that is immediately preceded with [x] + a whitespace + a ?, ! or whitespace + a whitespace and | char
  • ) - end of the group
  • \K - match reset operator that removes the text matched so far from the overall match memory buffer
  • ([^|\n]*) - zero or more chars other than | and a line feed char
  • (?=\|) - a | char must appear immediately to the right of the current location.
like image 37
Wiktor Stribiżew Avatar answered Nov 22 '25 09:11

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!