Split a file based on number of groups in first column in bash and maximum line number

Question

Consider the following (sorted) file test.txt where in the first column a occurs 3 times, b occurs once, c occurs 2 times and d occurs 4 times.

a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1

I would like to split this file to smaller files with maximum 4 lines. However, I need to retain the the groups in the smaller files, meaning that all lines that start with the same value in column $1 need to be in the same file. The size of the group is in this example never larger than the desired output length.

The expected output would be:

file1:

a 1
a 2
a 1
b 1

file2:

c 1
c 1

file3:

d 2
d 1
d 2
d 1

From the expected output, you can see that it if two or more groups together have less than the maximum line number (here 4), they should go into the same file.

Therefore: a + b have together 4 entries and they can go into the same file. However, c + d have together 6 entris. Therefore c has to go in its own file.

I am aware of this Awk oneliner:

awk '{print>$1".test"}' test.txt

But this results in a separate file for each group. This would not make much sense in the real-world problem that I am facing since it would lead to a lot of files being transferred to the HPC and back and making the overhead too intense.

A bash solution would be preferred. But it could also be Python.

James Brown · Accepted Answer

Another awk. Had a busy day and this is only tested with your sample data so anything could happen. It creates files named filen.txt, where n>0:

$ awk -v n=4 '
BEGIN {
    fc=1                                         # file numbering initialized
}
{
    if($1==p||FNR==1)                            # when $1 remains same
        b=b (++cc==1?"":ORS) $0                  # keep buffering
    else {
        if(n-(cc+cp)>=0) {                       # if room in previous file
            print b >> sprintf("file%d.txt",fc)  # append to it
            cp+=cc                               
        } else {                                 # if it just won t fit
            close(sprintf("file%d.txt",fc))
            print b > sprintf("file%d.txt",++fc) # creat new
            cp=cc
        }
        b=$0
        cc=1
    }
    p=$1
}
END {                                            # same as the else above
    if(n-(cc+cp)>=0)
        print b >> sprintf("file%d.txt",fc)
    else {
        close(sprintf("file%d.txt",fc))
        print b > sprintf("file%d.txt",++fc)
    }
}' file

RavinderSingh13 · Answer

I hope I have understood your requirement correctly, could you please try following once written and tested with GNU awk.

awk -v count="1" '
FNR==NR{
  max[$1]++
  if(!a[$1]++){
    first[++count2]=$1
  }
  next
}
FNR==1{
  for(i in max){
    maxtill=(max[i]>maxtill?max[i]:maxtill)
  }
  prev=$1
}
{
  if(!b[$1]++){++count1};
  c[$1]++
  if(prev!=$1 && prev){
    if((maxtill-currentFill)<max[$1]){count++}
    else if(maxtill==max[$1])        {count++}
  }
  else if(prev==$1 && c[$1]==maxtill && count1<count2){
    count++
  }
  else if(c[$1]==maxtill && prev==$1){
    if(max[first[count1+1]]>(maxtill-c[$1])){ count++ }
  }
  prev=$1
  outputFile="outfile"count
  print > (outputFile)
  currentFill=currentFill==maxtill?1:++currentFill
}
'  Input_file  Input_file

Testing of above solution with OP's sample Input_file:

cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1

It will create 3 output files named outputfile1, outputfile2 and outputfile3 as follows.

cat outfile1
a 1
a 2
a 1
b 1
cat outfile2
c 1
c 1
cat outfile3
d 2
d 1
d 2
d 1

2nd time testing(with my custom samples): With my own sample Input_file, lets say following is Input_file.

cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
d 4
d 5

When I run above solution then 2 outputfiles will be created with name outputfile1 and outputfile2 as follows.

cat outputfile1
a 1
a 2
a 1
b 1
c 1
c 1
cat outfile2
d 2
d 1
d 2
d 1
d 4
d 5

Split a file based on number of groups in first column in bash and maximum line number

Tags:

bash

sed

awk

MKR

2 Answers

James Brown

RavinderSingh13

Recent Activity

Donate For Us

Split a file based on number of groups in first column in bash and maximum line number

Tags:

bash

sed

awk

MKR

2 Answers

James Brown

RavinderSingh13

Related questions

Recent Activity

Donate For Us