I want to find all elements of an array a1 which items are not a part of array a2 and array a3.
For example:
$a1 = @(1,2,3,4,5,6,7,8)
$a2 = @(1,2,3)
$a3 = @(4,5,6,7)
Expected result:
8
Try this:
$a2AndA3 = $a2 + $a3
$notInA2AndA3 = $a1 | Where-Object {!$a2AndA3.contains($_)}
As a one liner:
$notInA2AndA3 = $a1 | Where {!($a2 + $a3).contains($_)}
k7s5a's helpful answer is conceptually elegant and convenient, but there's a caveat:
It doesn't scale well, because an array lookup must be performed for each $a1 element.
At least for larger arrays, PowerShell's Compare-Object cmdlet is the better choice:
If the input arrays are ALREADY SORTED:
(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject
Note:
* Compare-Object doesn't require sorted input, but it can greatly enhance performance - see below.
* As Esperento57 points out, (Compare-Object $a1 ($a2 + $a3)).InputObject is sufficient in the specific case at hand, but only because $a2 and $a3 happen not to contain elements that aren't also in $a1.
Therefore, the more general solution is to use filter Where-Object SideIndicator -eq '<=', because it limits the results to objects missing from the LHS ($a1), and not also vice versa.
If the input arrays are NOT SORTED:
Explicitly sorting the input arrays before comparing them greatly enhances performance:
(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) |
Where-Object SideIndicator -eq '<=').InputObject
The following example, which uses a 10,000-element array, illustrates the difference in performance:
$count = 10000 # Adjust this number to test scaling.
$a1 = 0..$($count-1) # With 10,000: 0..9999
$a2 = 0..$($count/2) # With 10,000: 0..5000
$a3 = $($count/2+1)..($count-3) # With 10,000: 5001..9997
$(foreach ($pass in 1..2) {
if ($pass -eq 1 ) {
$passDescr = "SORTED input"
} else {
$passDescr = "UNSORTED input"
# Shuffle the arrays.
$a1 = $a1 | Get-Random -Count ([int]::MaxValue)
$a2 = $a2 | Get-Random -Count ([int]::MaxValue)
$a3 = $a3 | Get-Random -Count ([int]::MaxValue)
}
[pscustomobject] @{
TestCategory = $passDescr
Test = "CompareObject, explicitly sorted first"
Timing = (Measure-Command {
(Compare-Object ($a1 | Sort-Object) ($a2 + $a3 | Sort-Object) | Where-Object SideIndicator -eq '<=').InputObject |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "CompareObject"
Timing = (Measure-Command {
(Compare-Object $a1 ($a2 + $a3) | Where-Object SideIndicator -eq '<=').InputObject |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), two-pass"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3
$a1 | Where-Object { !$a2AndA3.Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), two-pass, explicitly sorted first"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3 | Sort-Object
$a1 | Sort-Object | Where-Object { !$a2AndA3.Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "!.Contains(), single-pass"
Timing = (Measure-Command {
$a1 | Where-Object { !($a2 + $a3).Contains($_) } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, two-pass"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3
$a1 | Where-Object { $a2AndA3 -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, two-pass, explicitly sorted first"
Timing = (Measure-Command {
$a2AndA3 = $a2 + $a3 | Sort-Object
$a1 | Sort-Object | Where-Object { $a2AndA3 -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
},
[pscustomobject] @{
TestCategory = $passDescr
Test = "-notcontains, single-pass"
Timing = (Measure-Command {
$a1 | Where-Object { ($a2 + $a3) -notcontains $_ } |
Out-Host; '---' | Out-Host
}).TotalSeconds
}
}) |
Group-Object TestCategory | ForEach-Object {
"`n=========== $($_.Name)`n"
$_.Group | Sort-Object Timing | Select-Object Test, @{ l='Timing'; e={ '{0:N3}' -f $_.Timing } }
}
Sample output from my machine (output of missing array elements omitted):
=========== SORTED input
Test Timing
---- ------
CompareObject 0.068
CompareObject, explicitly sorted first 0.187
!.Contains(), two-pass 0.548
-notcontains, two-pass 6.186
-notcontains, two-pass, explicitly sorted first 6.972
!.Contains(), two-pass, explicitly sorted first 12.137
!.Contains(), single-pass 13.354
-notcontains, single-pass 18.379
=========== UNSORTED input
CompareObject, explicitly sorted first 0.198
CompareObject 6.617
-notcontains, two-pass 6.927
-notcontains, two-pass, explicitly sorted first 7.142
!.Contains(), two-pass 12.263
!.Contains(), two-pass, explicitly sorted first 12.641
-notcontains, single-pass 19.273
!.Contains(), single-pass 25.174
While timings will vary based on many factors, you can get a sense that Compare-Object scales much better, if the input is either pre-sorted or sorted on demand, and the performance gap widens with increasing element count.
When not using Compare-Object, performance can be somewhat increased - but not being able to take advantage of sorting is the fundamentally limiting factor:
Neither -notcontains / -contains nor .Contains() can take full advantage of presorted input.
If the input is already sorted: Using the .Contains() IList interface .NET method rather than the PowerShell -contains / -notcontains operators (which an earlier version of k7s5a's answer used) improves performance.
Joining arrays $a2 and $a3 once, up front, and then using the joined array in the pipeline improves performance (that way, the arrays don't have to be joined in every iteration).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With