Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CSV iteration in Ruby, and grouping by column value to get last line of each group

Tags:

loops

ruby

csv

I have a csv of transaction data, with columns like:

ID,Name,Transaction Value,Running Total,  
5,mike,5,5,  
5,mike,2,7,  
20,bob,1,1,  
20,bob,15,16,  
1,jane,4,4,  
etc...

I need to loop through every line and do something with the transaction value, and do something different when I get to the last line of each ID.

I currently do something like this:

total = ""
id = ""
idHold = ""
totalHold = ""

CSV.foreach(csvFile) do |row|
    
    totalHold = total
    idHold = id

    id = row[0]
    value = row[2]
    total = row[3]

    if id != idHold
       # do stuff with the totalHold here
    end
end

But this has a problem - it skips the last line. Also, something about it doesn't feel right. I feel like there should be a better way of detecting the last line of an 'ID'.

Is there a way of grouping the id's and then detecting the last item in the id group?

note: all id's are grouped together in the csv

like image 634
I.M. Avatar asked Feb 02 '26 11:02

I.M.


2 Answers

Yes.. ruby has got your back.

grouped = CSV.table('./test.csv').group_by { |r| r[:id] }

# Then process the rows of each group individually:
grouped.map { |id, rows|
  puts [id, rows.length ]
}

Tip: You can access each row as a hash by using CSV.table

CSV.table('./test.csv').first[:name]
=> "mike"
like image 148
Shiyason Avatar answered Feb 04 '26 00:02

Shiyason


Let's first construct a CSV file.

str =<<~END
ID,Name,Transaction Value,Running Total  
5,mike,5,5  
5,mike,2,7  
20,bob,1,1  
20,bob,15,16  
1,jane,4,4
END
CSVFile = 't.csv'
File.write(CSVFile, str)
  #=> 107

I will first create a method that takes two arguments: an instance of CSV::row and a boolean to indicate whether the CSV row is the last of the group (true if it is).

def process_row(row, is_last)
  puts "Do something with row #{row}"
  puts "last row: #{is_last}"
end 

This method would of course be modified to perform whatever operations need be performed for each row.

Below are three ways to process the file. All three use the method CSV::foreach to read the file line-by-line. This method is called with two arguments, the file name and an options hash { header: true, converters: :numeric } that indicates that the first line of the file is a header row and that strings representing numbers are to be converted to the appropriate numeric object. Here values for "ID", "Transaction Value" and "Running Total" will be converted to integers.

Though it is not mentioned in the doc, when foreach is called without a block it returns an enumerator (in the same way that IO::foreach does).

We of course need:

require 'csv'

Chain foreach to Enumerable#chunk

I have chosen to use chunk, as opposed to Enumerable#group_by, because the lines of the file are already grouped by ID.

CSV.foreach(CSVFile, headers:true, converters: :numeric).
    chunk { |row| row['ID'] }.
    each do |_,(*arr, last_row)|
      arr.each { |row| process_row(row, false) }
      process_row(last_row, true)
    end

displays

Do something with row 5,mike,5,5  
last row: false
Do something with row 5,mike,2,7  
last row: true
Do something with row 20,bob,1,1  
last row: false
Do something with row 20,bob,15,16  
last row: true
Do something with row 1,jane,4,4
last row: true

Note that

enum = CSV.foreach(CSVFile, headers:true, converters: :numeric).
           chunk { |row| row['ID'] }.
           each
  #=> #<Enumerator: #<Enumerator::Generator:0x00007ffd1a831070>:each>

Each element generated by this enumerator is passed to the block and the block variables are assigned values by a process called array decomposition:

_,(*arr,last_row) = enum.next
 #=> [5, [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total  ":5>,
 #        #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total  ":7>]] 

resulting in the following:

_ #=> 5
arr
  #=> [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total  ":5>] 
last_row
  #=> #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total  ":7>

See Enumerator#next.

I have followed the convention of using an underscore for block variables that are used in the block calculation (to alert readers of your code). Note that an underscore is a valid block variable.1

Use Enumerable#slice_when in place of chunk

CSV.foreach(CSVFile, headers:true, converters: :numeric).
    slice_when { |row1,row2| row1['ID'] != row2['ID'] }.
    each do |*arr, last_row|
      arr.each { |row| process_row(row, false) }
      process_row(last_row, true)
    end

This displays the same information that is produced when chunk is used.

Use Kernel#loop to step through the enumerator CSV.foreach(CSVFile, headers:true)

enum = CSV.foreach(CSVFile, headers:true, converters: :numeric)
row = nil
loop do
  row = enum.next
  next_row = enum.peek 
  process_row(row, row['ID'] != next_row['ID'])
end
process_row(row, true)

This displays the same information that is produced when chunk is used. See Enumerator#next and Enumerator#peek.

After enum.next returns the last CSV::Row object enum.peek will generate a StopIteration exception. As explained in its doc, loop handles that exception by breaking out of the loop. row must be initialized to an arbitrary value before entering the loop so that row is visible after the loop terminates. At that time row will contain the CSV::Row object for the last line of the file.

1 IRB uses the underscore for its own purposes, resulting in the block variable _ being assigned an erroneous value when the code above is run.

like image 40
Cary Swoveland Avatar answered Feb 04 '26 00:02

Cary Swoveland



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!