Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.
With the following example
using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))
I want to extract the rows with ID taking the values 1 and 4. For the moment, I came out with that solution.
@where(DT, findall(x -> (x==4 || x==1), DT.ID))
When using only two values, it is manageable.
However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected
Any fancier solution to make this selection generic?
Damien
We can filter rows by using filter(source => f::Function, df) . Note how this function is very similar to the function filter(f::Function, V::Vector) from Julia Base module.
Additionally, in your example, you should use select! in order to modify the column names in place, or alternatively do 'df = select(df, "col1" => "Id", "col2" => "Name")` as select always return a new DataFrame .
Here is a way to do it using standard DataFrames.jl indexing and using @where from DataFramesMeta.jl:
julia> DT
8×2 DataFrame
│ Row │ ID    │ x1        │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.433397  │
│ 2   │ 1     │ 0.963775  │
│ 3   │ 2     │ 0.365919  │
│ 4   │ 2     │ 0.325169  │
│ 5   │ 3     │ 0.0495252 │
│ 6   │ 3     │ 0.637568  │
│ 7   │ 4     │ 0.391051  │
│ 8   │ 4     │ 0.436209  │
julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │
julia> @where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │
In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):
julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │
Note that in the approach you mention in your question you could omit DT in front of ID like this:
julia> @where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │
(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With