Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a file based on number of groups in first column in bash and maximum line number

Tags:

bash

sed

awk

Consider the following (sorted) file test.txt where in the first column a occurs 3 times, b occurs once, c occurs 2 times and d occurs 4 times.

a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1

I would like to split this file to smaller files with maximum 4 lines. However, I need to retain the the groups in the smaller files, meaning that all lines that start with the same value in column $1 need to be in the same file. The size of the group is in this example never larger than the desired output length.

The expected output would be:

file1:

a 1
a 2
a 1
b 1

file2:

c 1
c 1

file3:

d 2
d 1
d 2
d 1

From the expected output, you can see that it if two or more groups together have less than the maximum line number (here 4), they should go into the same file.

Therefore: a + b have together 4 entries and they can go into the same file. However, c + d have together 6 entris. Therefore c has to go in its own file.

I am aware of this Awk oneliner:

awk '{print>$1".test"}' test.txt

But this results in a separate file for each group. This would not make much sense in the real-world problem that I am facing since it would lead to a lot of files being transferred to the HPC and back and making the overhead too intense.

A bash solution would be preferred. But it could also be Python.

like image 360
MKR Avatar asked Nov 29 '25 06:11

MKR


2 Answers

Another awk. Had a busy day and this is only tested with your sample data so anything could happen. It creates files named filen.txt, where n>0:

$ awk -v n=4 '
BEGIN {
    fc=1                                         # file numbering initialized
}
{
    if($1==p||FNR==1)                            # when $1 remains same
        b=b (++cc==1?"":ORS) $0                  # keep buffering
    else {
        if(n-(cc+cp)>=0) {                       # if room in previous file
            print b >> sprintf("file%d.txt",fc)  # append to it
            cp+=cc                               
        } else {                                 # if it just won t fit
            close(sprintf("file%d.txt",fc))
            print b > sprintf("file%d.txt",++fc) # creat new
            cp=cc
        }
        b=$0
        cc=1
    }
    p=$1
}
END {                                            # same as the else above
    if(n-(cc+cp)>=0)
        print b >> sprintf("file%d.txt",fc)
    else {
        close(sprintf("file%d.txt",fc))
        print b > sprintf("file%d.txt",++fc)
    }
}' file
like image 61
James Brown Avatar answered Dec 02 '25 01:12

James Brown


I hope I have understood your requirement correctly, could you please try following once written and tested with GNU awk.

awk -v count="1" '
FNR==NR{
  max[$1]++
  if(!a[$1]++){
    first[++count2]=$1
  }
  next
}
FNR==1{
  for(i in max){
    maxtill=(max[i]>maxtill?max[i]:maxtill)
  }
  prev=$1
}
{
  if(!b[$1]++){++count1};
  c[$1]++
  if(prev!=$1 && prev){
    if((maxtill-currentFill)<max[$1]){count++}
    else if(maxtill==max[$1])        {count++}
  }
  else if(prev==$1 && c[$1]==maxtill && count1<count2){
    count++
  }
  else if(c[$1]==maxtill && prev==$1){
    if(max[first[count1+1]]>(maxtill-c[$1])){ count++ }
  }
  prev=$1
  outputFile="outfile"count
  print > (outputFile)
  currentFill=currentFill==maxtill?1:++currentFill
}
'  Input_file  Input_file


Testing of above solution with OP's sample Input_file:

cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1

It will create 3 output files named outputfile1, outputfile2 and outputfile3 as follows.

cat outfile1
a 1
a 2
a 1
b 1
cat outfile2
c 1
c 1
cat outfile3
d 2
d 1
d 2
d 1


2nd time testing(with my custom samples): With my own sample Input_file, lets say following is Input_file.

cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
d 4
d 5

When I run above solution then 2 outputfiles will be created with name outputfile1 and outputfile2 as follows.

cat outputfile1
a 1
a 2
a 1
b 1
c 1
c 1
cat outfile2
d 2
d 1
d 2
d 1
d 4
d 5
like image 37
RavinderSingh13 Avatar answered Dec 02 '25 01:12

RavinderSingh13