I have a unique list of strings (the original idea was the column names in a table). The task is to perform a maximal possible abbreviation of the list, so the list remains distinct.
For example AAA, AB can be abbreviated to AA, AB. (But not to A, AB – as A could be prefix of both AAA and AB).
AAAA, BAAAA can be shorten to A, B.
But A1, A2 can’t be abbreviated at all.
Here are the sample data
create table tab as 
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;
The expected result is
COL    ABR_COL                
------ ------------------------
AAA    AAA                      
AAAAAA AAAA                     
AABA   AAB                      
AB     AB                       
BAAAA  BA                       
BBAA   BB                       
COL1   COL1                     
COL21  COL2        
I managed a brute force solution consisting of four subqueries, which I do not post on purpose, because I hope there exists a more simple solution from which I do not want to distract.
Btw there is a similar function in r called abbreviate, but I’m looking for SQL solution. Prefered Oracle solutions for other RDBMS are welcommed.
An abbreviation, simply put, is a shortened form of a word. In writing, abbreviations are useful when you need to squeeze a lot of writing into a small space. You can also use them in place of long or cumbersome phrases to make your sentences easier to read.
An abbreviation (from Latin brevis, meaning short) is a shortened form of a word or phrase, by any method.
ea. written abbreviation for each: used to give the price, weight, etc.
I would do the filtering in the recursive CTE:
with potential_abbreviations(col, abbr, lev) as (
      select col, col as abbr, 1 as lev
      from tab
      union all
      select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
      from potential_abbreviations pa
      where length(abbr) > 1 and
            not exists (select 1
                        from tab
                        where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
                              tab.col <> pa.col
                       )
     )
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
      from potential_abbreviations pa
     ) pa
where seqnum = 1
Here is a db<>fiddle.
The lev is strictly not necessary.  You can use length(abbr) desc in the order by.  But, I usually include a recursion counter when I use recursive CTEs, so this is habit.
Doing the extra comparison in the CTE may look more complicated, but it simplifies the execution -- the recursion stops at the correct value.
This is also tested on unique single letter col values.
This is actually possible using a recursive CTE. I don't really get it shorter than three subqueries (plus one query), but at least it is not constrained by string length. The steps are roughly as follows:
Table:
 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...
Table
ABBR    CONFLICT
----    --------
AA      3
AAA     2
AABA    1
...
AAA conflicts with some other abbreviation but still must be chosen as it is equal to its unshortened name.Table
COL     ABBR    CONFLICT    POS
-------------------------------
AAA     AAA     2           1
AAAAAA  AAAA    1           1
AAAAAA  AAAAA   1           2
AAAAAA  AAAAAA  1           3
AABA    AAB     1           1
...
Table
COL     ABBR    POS
-------------------
AAA     AAA     1
AAAAAA  AAAA    1
AABA    AAB     1
...
This results in the following SQL, with the above steps as CTEs:
with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations
  where length(abbr) > 1
)
, abbreviation_counts as (
  select abbr
       , count(*) as conflict
  from potential_abbreviations
  group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
    p.col
  , p.abbr
  , conflict
  , rank() over (partition by col order by p.abbr) as pos
  from potential_abbreviations p
    join abbreviation_counts c on p.abbr = c.abbr
    where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
 order by col, abbr
COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2
SQL Fiddle
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With