Regex with hyperlinks ActiveDocument.Range and Format

Question

Tobias answer seems to be the ticket. Just wanted to add that I just realized the quantifier was meaningless in the character class. Also noticed a colleague's emails often have a space in front of the number and after the dollar sign, so some better regex is below (for US dollar amts):

RegExp.Pattern = "\$\s*([\,\d]*(?:\.\d{2})?)"

With some inspiration from this: What does a hyperlink range.start and range.end refer to? Came up with this:

Sub trueUpAttempt()
Dim OrigLength As Long
Debug.Print ActiveDocument.Characters.Count

Dim SelStart As Long
Dim SelEnd As Long
Dim SelLength As Long

Dim rHyperlink As Range
Dim wdHyperlink As Hyperlink
    For Each wdHyperlink In ActiveDocument.Hyperlinks
        Set rHyperlink = wdHyperlink.Range
        'Debug.Print rHyperlink.Start
        'Debug.Print rHyperlink.End
        'Debug.Print rHyperlink.End - rHyperlink.Start
        Debug.Print rHyperlink.End - rHyperlink.Start - Len(rHyperlink)
        'there's got to be some way to true up the character offset, even if its ugly
        Debug.Print ActiveDocument.Characters.Count + rHyperlink.End - rHyperlink.Start - Len(rHyperlink)
    Next
End Sub

That's not a fix, but I think is an outline to reconcile the character offsets. This is all because word is counting all 62 characters in for example {HYPERLINK "http://www.smithany.com"} http://www.smithany.com

Edit 7-22-2023 attempting Tobais suggestion in reverse:

Sub DollarHighlighter2()
Set regExp = New regExp
Dim objMatch As Match
Dim colMatches As MatchCollection
Dim offsetEnd As Long
offsetEnd = Selection.End
regExp.Pattern = "\$([\,\d{1,3}]*(?:\.\d{2})?)"
regExp.Global = True
Set allMatches = regExp.Execute(Selection.text)   ' Execute search.
For i = allMatches.Count - 1 To 0 Step -1
    'MsgBox allMatches.Item(i)
    ActiveDocument.Range(offsetEnd - allMatches.Item(i).FirstIndex, End:=offsetEnd - allMatches.Item(i).FirstIndex + allMatches.Item(i).Length).FormattedText.HighlightColorIndex = wdYellow
Next
End Sub

But this still seems to have a similar issue with links, and perhaps other content. I also tried the same Range determination forwards, but looping over matches in reverse and had similar problems.

Working link to example file here (no ssl): http://www.smithany.com/exampleDollarHighliter.docx

Original: I have seen several other StackOverflow posts such as this one: How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word on using regular expressions in Microsoft Word with VBA using the Microsoft VB script Regular Expressions 5.5 Reference.

That helped me prepare the following, which I use in Word to highlight US Dollar amounts:

Sub dollarHighlighter()
Set regExp = New regExp
Dim objMatch As Match
Dim colMatches As MatchCollection
Dim offsetStart As Long
offsetStart = Selection.Start
regExp.Pattern = "\$([\,\d{1,3}]*(?:\.\d{2})?)"
regExp.Global = True
Set colMatches = regExp.Execute(Selection.Text)   ' Execute search.
For Each objMatch In colMatches   ' Iterate Matches collection.
  Set myRange = ActiveDocument.Range(objMatch.FirstIndex + offsetStart, 
    End:=offsetStart + objMatch.FirstIndex + objMatch.Length)
  myRange.FormattedText.HighlightColorIndex = wdYellow
Next
   End Sub

While this works as expected on a list of dollar amounts within text (for the most part - among its imperfections the regex is intentionally a bit loose) it does not work as anticipated when there are hyperlinks present in the Word document.

In that instance, there appears to be a shift in offset of the highlighted characters in a somewhat unpredictable manner. I assume this is because there is a lot of new xml/css in the document.xml source file.

Ultimately, my overarching questions is, can I use regex to highlight word document contents even if it contains hyperlinks? Is it an offset question or should I run the regex on the compressed xml, re compress and reopen for better results? As when I test various regex variations on the source code, I get the anticipated results, but not when formatting what would be the Word range.

I have also asked this here: https://social.msdn.microsoft.com/Forums/en-US/3a95c5e4-9e0c-4da9-970f-e0bf801c3170/macro-for-a-regexp-search-replace?forum=isvvba&prof=required but realize it was an ancient post...

Per question below, here are some possibly helpful links: an example document http://www.smithany.com/test.docx step 1 http://www.smithany.com/wordusd1.jpg Step 2 http://www.smithany.com/wordhighlighterrun.jpg and what happens http://www.smithany.com/whatactuallyhappens.jpg

Temporary Workaround: As suggested below Word's Wildcard find is fast if you do not stack the loops. try this:

Sub Macro2()
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.highlight = True
With Selection.Find
    .Text = "$[0-9,]{1,}"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindContinue
    .Format = True
    .MatchCase = False
    .MatchWholeWord = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.highlight = True
With Selection.Find
    .Text = "$[0-9,]{1,}.[0-9]{2,3}"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindContinue
    .Format = True
    .MatchCase = False
    .MatchWholeWord = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll

End Sub

which basically gets all the dollar amounts highlighted. That said, complex expressions like matching a variety of date formats could get messy, but i suppose it is all possible doing them one step at a time.

wp78de · Accepted Answer

I had not touched VBA for years but I guess it's like bicycling.

Anyways, here is a sub that should help you out. It's based on Cindy Meister sound recommendation and fills the gap between Regex and Wildcard Find using a collection of match patterns for optional parts.

First, the Wildcard matches: $[0-9,]{1,} and $[0-9,]{1,}.[0-9]{2}

It's not that different after all, isn't it? However, to take the optional fraction part into account I have to use two patterns.

And here's the routine:

Sub WildcardsHighlightWords()
    Dim Word As Range
    Dim WildcardCollection(2) As String
    Dim Words As Variant
    WildcardCollection(0) = "$[0-9,]{1,}"
    WildcardCollection(1) = "$[0-9,]{1,}.[0-9]{2}"
    Options.DefaultHighlightColorIndex = wdYellow
    'Clear existing formatting and settings in Find
    Selection.Find.ClearFormatting
    Selection.Find.Replacement.ClearFormatting
    'Set highlight to replace setting.
    Selection.Find.Replacement.Highlight = True
    'Cycle through document and find wildcards patterns, highlight words when found
    For Each Word In ActiveDocument.Words
        For Each WildcardsPattern In WildcardCollection
            With Selection.Find
                .Text = WildcardsPattern
                .Replacement.Text = ""
                .Forward = True
                .Wrap = wdFindContinue
                .Format = True
                .MatchCase = False
                .MatchWholeWord = False
                .MatchWildcards = True
                .MatchSoundsLike = False
                .MatchAllWordForms = False
            End With
            Selection.Find.Execute Replace:=wdReplaceAll
        Next
    Next
End Sub

It should be easy to extend or modify this approach if needed.

This highlithts the Dollar amounts as desired on my end:

enter image description here

Note: The separator in the quantifier {n,m} is not the same in all localizations, e.g. it's {n;m} in the German version.

tobias · Answer

Update 26.07.2023: You can easily circument all those problems if you go through your document paragraph by paragraph. However, this works in your case because the regex matches stay within paragraph boundaries!

Given this limitation, the following vba code will work:

Sub DollarHighlighter4()
    
    '26.07.2023, works within tables
    Dim RegExp As RegExp
    Dim allMatches As MatchCollection
    Dim wdPar As Paragraph
    Dim rngPar, rngDoc, rngFormat As Range
    Dim i, intA, intB As Integer
    
    Set rngDoc = ActiveDocument.Range
    
    Set RegExp = New RegExp
    RegExp.Pattern = "\$([\,\d{1,3}]*(?:\.\d{2})?)"
    RegExp.Global = True

    For Each wdPar In rngDoc.Paragraphs
        
        Set rngPar = wdPar.Range
        ' Get all matches, within current paragraph
        Set allMatches = RegExp.Execute(rngPar)
        
        ' Highlight all matches, within current paragraph
        For i = allMatches.Count - 1 To 0 Step -1
            intA = allMatches.Item(i).FirstIndex
            intB = intA + allMatches.Item(i).Length
            Set rngPar = wdPar.Range ' Always reset range to whole content
            Set rngFormat = wdPar.Range 'current Paragraph.Range
            ' Adjust text-range to actual regex-match
            ' Character-address refers to current paragraph
            rngFormat.SetRange Start:=rngPar.Characters(intA + 1).Start, _
                End:=rngPar.Characters(intB).End
            ' Perform action to range
            rngFormat.FormattedText.HighlightColorIndex = wdYellow
        Next

    Next wdPar
    
    'Finish
    Set rngFormat = Nothing
    Set rngPar = Nothing
    Set rngDoc = Nothing
    Set RegExp = Nothing
    Set allMatches = Nothing
    
End Sub

@Allan: You should use YourVariable.SetRange so that you can define a range based upon character positions.

This should work:

Sub DollarHighlighter3()
Set regExp = New regExp
Dim objMatch As Match
Dim colMatches As MatchCollection
Dim offsetEnd As Long
Dim rngFormat As Range
Dim intA, intB As Integer
regExp.Pattern = "\$([\,\d{1,3}]*(?:\.\d{2})?)"
regExp.Global = True
Set allMatches = regExp.Execute(ActiveDocument.Content)   ' Execute search.
For i = allMatches.Count - 1 To 0 Step -1
    intA = allMatches.Item(i).FirstIndex
    intB = intA + allMatches.Item(i).Length
    Set rngFormat = ActiveDocument.Range
    rngFormat.SetRange Start:=ActiveDocument.Range.Characters(intA).End, _
        End:=ActiveDocument.Range.Characters(intB).End
    rngFormat.FormattedText.HighlightColorIndex = wdYellow
Next
End Sub

Yesterday (20.07.2023), I faced the same question: Recognize text occurrences based upon regex patterns - and converting them into hyperlinks.

What worked for me is: Backward solving!

The regex object, once it is "SET", has static index values, based upon the original word text. By inserting hyperlinks the word text gets longer. So either you redefine the regex object after each text action (problem: if the hyperlink inserted will get a match by itself...). Or you parse your document from end to start. This can be done by a countdown loop, starting with the last regex occurrence.

Regex with hyperlinks ActiveDocument.Range and Format

Tags:

regex

ms-word

vba

Allan

2 Answers

wp78de

tobias

Recent Activity

Donate For Us

Regex with hyperlinks ActiveDocument.Range and Format

Tags:

regex

ms-word

vba

Allan

2 Answers

wp78de

tobias

Related questions

Recent Activity

Donate For Us