Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get a 2 String array from HTML .... using regex?

Tags:

arrays

regex

I'm working on a personal project to auto fill out the USPS Click & Ship form and then output the Ref# and the Delivery Confirmation #

So far I've been able to get the whole process done, but I can't for the life of me figure out how to pull out the Ref# (which is my order #) and the Delivery Confirmation #

Basically for every package you print a label for the confirmation HTML page comes back with the following in the page.

 <tr class="smTableText">
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px;" valign="top">
    <table cellpadding="0" cellspacing="0" border="0" style="margin:7px 0px 0px 5px;">
      <tr> 
       <td valign="top" class="mainText" width=46>1 of 1</td>  
       <td valign="top" width=21><a href="javascript:toggleMoreInfo(0)" tabindex="19"><img src="/cns/images/common/button_plus.gif" height="11" width="11" border="0" hspace="0" vspace="0" id="Img1" style="margin-right:10px;" alt=""></a></td>  
       <td valign="top" width=203><div class="mainText" style="margin-bottom:10px; height:1em; overflow:hidden;" id="Div1">FIRSTLAST NAME<BR>STREET ADDRESS<BR>CITY, STATE  ZIP5-ZIP4<div class="smTableText">[email protected]<BR>Ref#: 100000000<BR></div> </div><div class="smTableText"></div> </td> 
      </tr>
    </table>
  </td> 
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px; padding-top:7px;" valign="top" class="smTableText"><div id="Div2" style="margin-left:7px; height:2.4em; overflow:hidden;">&nbsp;Ship Date: 11/17/09<br>&nbsp;Weight: 0lbs 9oz<br>&nbsp;From: 48506<br></div></td>
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px; padding-right:15px; padding-top:7px;" valign="top" align="right" class="smTableText"><div class="smTableText" id="Div3" style="height:2.4em; overflow:hidden; margin-bottom:3px;">Priority Mail                      <br>Delivery Confirm.<br></div> <span style="font-weight:bold;" class="smTableText">Label Total</span></td>
  <td style="border-top:solid 1px #AAAAAA; padding-bottom:4px; padding-right:15px; padding-top:7px;" valign="top" align="right" class="smTableText"><div class="smTableText" id="Div4" style="height:2.4em; overflow:hidden; margin-bottom:3px;">$4.80<br>$0.00<br></div><span class="smTableTextbold">$4.80</span></td>
</tr>
<tr class="smTableText"> <td colspan=4 style="height:20px;" valign="top"><div class="mainText" style="margin:0px; padding:4px 8px 0px 8px; display:block; border-top:solid 1px #AAAAAA;">Delivery Confirmation&#153; Label Number: <span class="mainTextbold">0000 1111 2222 3333 4444 55</span></div></td> </tr>

What I need to do is loop through the entire page and find "Ref#: " capture the next 9 characters. Then find the next "Label Number: <span class="mainTextbold">" and capture the next 27 characters. Each pair of Ref#: and Label Number: <span class="mainTextbold"> should be saved to an array.

I'm guessing that regex will probably be my best option for this? Can anyone provide an example of how this would work. VB.net preferred by C# is ok too.

UPDATE: As pointed out in the Comments, this is not XML but rather the HTML code from the WebBrowser Control which the page is being displayed on.

I am auto filling in each page, then invoking the click action on the submit button to get to the next page..... Problem is that this last page, the data I need isn't neatly written around a unique tag to that field for me to pull from...

UPDATE # 2 Alright, using The example given I have come up with the following. Seems like alot of work to pull out the 2 values. I am guessing there must be a more efficient way of doing it.

   'Sub getdeliverynum(ByVal sText As String)
Sub getdeliverynum()
    Me.MainTabControl.SelectedTab = USPSsiteTAB
    WebBrowser1.Navigate("http://www.vaporstix.com/usps.html")
    While Not WebBrowser1.ReadyState = WebBrowserReadyState.Complete
        Application.DoEvents()
    End While
    Dim input As String = WebBrowser1.DocumentText
    Dim pattern As String = "Ref#: ([^<]+)[\S\s]*?Label Number: <span class=""mainTextbold"">([^<]+)"

    For Each match As Match In Regex.Matches(input, pattern)
        Dim instance As Double
        Dim ref As String = ""
        Dim track As String = ""
        instance = 0
        For Each group As Group In match.Groups
            instance = instance + 1
            If instance = 1 Then
                'do nothing this is the full string.... 
            ElseIf instance = 2 Then
                ref = group.Value
            ElseIf instance = 3 Then
                track = group.Value
            End If
        Next
        'replace with insert to db... this is for testing.
        MsgBox("Ref: " + ref + vbCrLf + "Confirmation: " + track)
    Next

End Sub

1 Answers

You should use System.xml and use a proper parser to do that work. Xpath or even navigating in the XmlDocument would permit you to achieve what you are looking for.

Dim xpathDoc As XPathDocument
Dim xmlNav As XPathNavigator

Dim xmlNI As XPathNodeIterator
xpathDoc = New XPathDocument("c:\builder.xml")
xmlNav = xpathDoc.CreateNavigator()
xmlNI = xmlNav.Select("//span[@class='mainTextbold']")
While (xmlNI.MoveNext())
    System.Console.WriteLine(xmlNI.Current.Name + " : " + xmlNI.Current.Value)
End While

I suggest you to take a look there or there for more info how to extract information from a XmlDocument

a Xpath selector like span[@class='mainTextbold'] would return you all those span.

as per Heinzi remark, your document doesn't seem to be valid XHTML, you should convert it to XHTML using TidyNet and then parse the result of the conversion.

like image 109
RageZ Avatar answered Mar 18 '26 14:03

RageZ



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!