I am trying to replicate the following pattern (shown in Python) in SAS:
>>> df = pd.DataFrame({'a': list('abc'), 'b': list('cba')})
>>> df
a b
0 a c
1 b b
2 c a
>>> df['key'] = [frozenset([x, y]) for x, y in df[['a', 'b']].values]
>>> df
a b key
0 a c (a, c)
1 b b (b)
2 c a (a, c)
>>> df.drop_duplicates('key')
a b key
0 a c (a, c)
1 b b (b)
Effectively, I have two columns, a and b. This table was produced via a self-join, and a and b represent the same canonical column. I need to drop duplicates on the combined/concatenated a and b columns without preserving the ordering of the values within those columns. I can do this very easily in Python, using the frozenset unordered set collection - see above.
I thought concatenating the columns then sorting might work, but in reality my columns are much more complex than single-letter cells, and the comment from this article
Hmm. "Tim Mott" and "Tom Mitt" will not match in name form, but will match in sorted anagram form. That could be interesting.
Made it clear that's not what I want to be doing.
Any tips on how to replicate this in SAS?
Edit: Ultimately, this is the view I'm trying to create, basically counting records by that unordered key:
>>> df.groupby('key').size()
key
(a, c) 2
(b) 1
dtype: int64
If you are just looking at two variables it is pretty simple. Just combine them with the smaller value first.
data have ;
input (var1-var2) ($);
cards;
a c
b b
c a
;
proc sql ;
create table want as
select
case when var1=var2 then var1
when var1<var2 then catx(',',var1,var2)
else catx(',',var2,var1)
end as key
, count(*) as size
from have
group by 1
;
quit;
Result:
Obs key size
1 a,c 2
2 b 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With