I have the following table
id error
-  ----------------------------------------
1  Error 1234eee5, can not write to disk
2  Error 83457qwe, can not write to disk
3  Error 72344ee, can not write to disk
4  Fatal barier breach on object 72fgsff
5  Fatal barier breach on object 7fasdfa
6  Fatal barier breach on object 73456xcc5
I want to be able to get a result that counts by similarity, where similarity of > 80% means two errors are equal. I've been using pg_trgm extension, and its similarity function works perfectly for me, the only thing I can figure out how to produce the grouping result below.
Error                                  Count
-------------------------------------  ------
Error 1234eee5, can not write to disk, 3
Fatal barier breach on object 72fgsff, 3
Basically you could join a table with itself to find similar strings, however this approach will end in a terribly slow query on a larger dataset. Also, using similarity() may cause inaccuracy in some cases (you need to find the appropriate limit value).
You should try to find patterns. For example, if all variable words in strings begin with a digit, you can mask them using regexp_replace():
select id, regexp_replace(error, '\d\w+', 'xxxxx') as error
from errors;
 id |                error                
----+-------------------------------------
  1 | Error xxxxx, can not write to disk
  2 | Error xxxxx, can not write to disk
  3 | Error xxxxx, can not write to disk
  4 | Fatal barier breach on object xxxxx
  5 | Fatal barier breach on object xxxxx
  6 | Fatal barier breach on object xxxxx
(6 rows)    
so you can easily group the data by error message:
select regexp_replace(error, '\d\w+', 'xxxxx') as error, count(*)
from errors
group by 1;
                error                | count 
-------------------------------------+-------
 Error xxxxx, can not write to disk  |     3
 Fatal barier breach on object xxxxx |     3
(2 rows)
The above query is only an example as the specific solution depends on the data format.
The solution based on the OP's idea (see the comments below). The limit 0.8 for similarity() is certainly too high. It seems that it should be somewhere about 0.6.
The table for unique errors (I've used a temporary table but it also be a regular one of course):
create temp table if not exists unique_errors(
    id serial primary key, 
    error text, 
    ids int[]);
The ids column is to store id of rows of the base table which contain similar errors.
do $$
declare
    e record;
    found_id int;
begin
    truncate unique_errors;
    for e in select * from errors loop
        select min(id)
        into found_id
        from unique_errors u
        where similarity(u.error, e.error) > 0.6;
        if found_id is not null then
            update unique_errors
            set ids = ids || e.id
            where id = found_id;
        else
            insert into unique_errors (error, ids)
            values (e.error, array[e.id]);
        end if;
    end loop;
end $$;
The final results:
select *, cardinality(ids) as count
from unique_errors;
 id |                 error                 |   ids   | count 
----+---------------------------------------+---------+-------
  1 | Error 1234eee5, can not write to disk | {1,2,3} |     3
  2 | Fatal barier breach on object 72fgsff | {4,5,6} |     3
(2 rows)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With