Consider the following (sorted) file test.txt where in the first column a occurs 3 times, b occurs once, c occurs 2 times and d occurs 4 times.
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
I would like to split this file to smaller files with maximum 4 lines. However, I need to retain the the groups in the smaller files, meaning that all lines that start with the same value in column $1 need to be in the same file. The size of the group is in this example never larger than the desired output length.
The expected output would be:
file1:
a 1
a 2
a 1
b 1
file2:
c 1
c 1
file3:
d 2
d 1
d 2
d 1
From the expected output, you can see that it if two or more groups together have less than the maximum line number (here 4), they should go into the same file.
Therefore: a + b have together 4 entries and they can go into the same file. However, c + d have together 6 entris. Therefore c has to go in its own file.
I am aware of this Awk oneliner:
awk '{print>$1".test"}' test.txt
But this results in a separate file for each group. This would not make much sense in the real-world problem that I am facing since it would lead to a lot of files being transferred to the HPC and back and making the overhead too intense.
A bash solution would be preferred. But it could also be Python.
Another awk. Had a busy day and this is only tested with your sample data so anything could happen. It creates files named filen.txt, where n>0:
$ awk -v n=4 '
BEGIN {
fc=1 # file numbering initialized
}
{
if($1==p||FNR==1) # when $1 remains same
b=b (++cc==1?"":ORS) $0 # keep buffering
else {
if(n-(cc+cp)>=0) { # if room in previous file
print b >> sprintf("file%d.txt",fc) # append to it
cp+=cc
} else { # if it just won t fit
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc) # creat new
cp=cc
}
b=$0
cc=1
}
p=$1
}
END { # same as the else above
if(n-(cc+cp)>=0)
print b >> sprintf("file%d.txt",fc)
else {
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc)
}
}' file
I hope I have understood your requirement correctly, could you please try following once written and tested with GNU awk.
awk -v count="1" '
FNR==NR{
max[$1]++
if(!a[$1]++){
first[++count2]=$1
}
next
}
FNR==1{
for(i in max){
maxtill=(max[i]>maxtill?max[i]:maxtill)
}
prev=$1
}
{
if(!b[$1]++){++count1};
c[$1]++
if(prev!=$1 && prev){
if((maxtill-currentFill)<max[$1]){count++}
else if(maxtill==max[$1]) {count++}
}
else if(prev==$1 && c[$1]==maxtill && count1<count2){
count++
}
else if(c[$1]==maxtill && prev==$1){
if(max[first[count1+1]]>(maxtill-c[$1])){ count++ }
}
prev=$1
outputFile="outfile"count
print > (outputFile)
currentFill=currentFill==maxtill?1:++currentFill
}
' Input_file Input_file
Testing of above solution with OP's sample Input_file:
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
It will create 3 output files named outputfile1, outputfile2 and outputfile3 as follows.
cat outfile1
a 1
a 2
a 1
b 1
cat outfile2
c 1
c 1
cat outfile3
d 2
d 1
d 2
d 1
2nd time testing(with my custom samples): With my own sample Input_file, lets say following is Input_file.
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
d 4
d 5
When I run above solution then 2 outputfiles will be created with name outputfile1 and outputfile2 as follows.
cat outputfile1
a 1
a 2
a 1
b 1
c 1
c 1
cat outfile2
d 2
d 1
d 2
d 1
d 4
d 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With