Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx PowerShell match

I have the following website http://www.shazam.com/charts/top-100/australia which displays songs, I want to capture the songs using RegEx & PowerShell. The PowerShell code below is what I have so far:

    $ie = New-Object -comObject InternetExplorer.Application
    $ie.navigate('http://www.shazam.com/charts/top-100/australia')
    Start-Sleep -Seconds 10
    $null = $ie.Document.body.innerhtml -match 'data-chart-position="1"(.|\n)*data-track-title=.*content="(.*)"><a href(.|\n)*data-track-artist=\W\W>(.|\n)*<meta\scontent="(.*)"\sitemprop';$shazam01artist = $matches[5];$shazam01title = $matches[2]

data-chart-position

data-track-title

data-track-artist

Each of the songs listed have the 3 values (above) associated with each of them, I want to capture the Artist & Title for each song based on the different chart positions (numbers). So a regular expression to find the actual chart position, then the trailing Artist & Title.

If I run the RegEx separately for Artist & Title (code below), it finds them, however it only finds the first Artist & Title. I need to find the Artist & Title for each song based on the different chart position.

$null = $ie.Document.body.innerhtml -match 'data-track-artist=\W\W>(.|\n)*<meta\scontent="(.*)"\sitemprop';$shazam01artist = $matches[2]
$null = $ie.Document.body.innerhtml -match 'data-track-title=.*content="(.*)"><a href';$shazam01title = $matches[1]
$shazam01artist
$shazam01title
like image 756
Marc Kean Avatar asked Aug 30 '25 16:08

Marc Kean


1 Answers

Using regex to parse partial HTML is an absolute nightmare, you might want to reconsider that approach.

Invoke-WebRequest returns a property called ParsedHtml, that contains a reference to a pre-parsed HTMLDocument object. Use that instead:

# Fetch the document
$Top100Response = Invoke-WebRequest -Uri 'http://www.shazam.com/charts/top-100/australia'

# Select all the "article" elements that contain charted tracks
$Top100Entries = $Top100Response.ParsedHtml.getElementsByTagName("article") |Where-Object {$_.className -eq 'ti__container'}

# Iterate over each article
$Top100 = foreach($Entry in $Top100Entries){
    $Properties = @{
        # Collect the chart position from the article element
        Position = $Entry.getAttribute('data-chart-position',0)
    }

    # Iterate over the inner paragraphs containing the remaining details
    $Entry.getElementsByTagName('p') |ForEach-Object {
        if($_.className -eq 'ti__artist') {
            # the ti__artist paragraph contains a META element that holds the artist name
            $Properties['Artist'] = $_.getElementsByTagName('META').item(0).getAttribute('content',0)
        } elseif ($_.className -eq 'ti__title') {
            # the ti__title paragraph stores the title name directly in the content attribute
            $Properties['Title']  = $_.getAttribute('content',0) 
        }
    }

    # Create a psobject based on the details we just collected
    New-Object -TypeName psobject -Property $Properties
}

Now, let's see how Tay-Tay's doing down under:

PS C:\> $Top100 |Where-Object { $_.Artist -match "Taylor Swift" }

Position           Title             Artist
--------           -----             ------
42                 Bad Blood         Taylor Swift Feat. Kendrick Lamar

Sweet!

like image 97
Mathias R. Jessen Avatar answered Sep 02 '25 11:09

Mathias R. Jessen