I have data in which the combination of two variables ("ManufactererId" and "ProductId") constitute unique keys / identifiers. The data looks like this:
my.data <- data.frame(ManufactererId = c(1, 1, 2, 2),
                      ProductId = c(1, 2, 1, 7),
                      Price = c(12.99, 149.00, 0.99, 3.99))
my.data
#   ManufactererId ProductId  Price
# 1              1         1  12.99
# 2              1         2 149.00
# 3              2         1   0.99
# 4              2         7   3.99
I want to ensure that I cannot accidentally add another row with a pair of ManufactererId - ProductId equal to what is already present in the table (like the unique constraint on a database table).
That is, if I try to add a row with ManufactererId = 2 and ProductId = 7 to my data frame:
my.data <- rbind(my.data, data.frame(ManufactererId = 2, ProductId = 7, Price = 120.00))
...it should fail with an error. How can this be achieved?
Or should I use a different data type?
By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
Pandas drop_duplicates() Function Syntax keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
1) zoo Whether this is convenient or not depends on what operations you want to do but zoo objects have unique indexes. We can construct a text index by pasting together the two Id columns.
library(zoo)
z <- with(my.data, zoo(Price, paste(ManufactererId, ProductId)))
z <- c(z, zoo(90, "1 1")) # Error, not appended
z <- c(z, zoo(90, "1 3")) # OK
Note that the data part of a zoo object can be a vector as shown above or a matrix in case you have more than just Price in the data.
2) SQLite This could be done with any of a number of databases but we will use SQLite here. First we create a table with a unique index in an SQLite database and then we insert rows.
library(RSQLite)
con <- dbConnect(SQLite())
dbWriteTable(con, "my", my.data, row.names = FALSE)
dbGetQuery(con, "create unique index ix on my(ManufactererId, ProductId)")
dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 1, 99)) # error
dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 13, 90)) # OK
You can do something like this: Where keys is your unique-key
append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(ncol(DF) != ncol(to_be_appended) || !all(names(DF) %in% names(to_be_appended))){
    stop("must have the same columns")
  }
  if(nrow(merge(DF, to_be_appended, by=keys))==0){
    rbind(DF, to_be_appended)
  } else {
    stop("Trying to append douplicated indices")
  }
}
Test it:
to_be_appended = data.frame(ManufactererId=2,ProductId=17,Price=3.99)
append_save(my.data, to_be_appended) # works
to_be_appended_err = data.frame(ManufactererId=2,ProductId=7,Price=3.99)
append_save(my.data, to_be_appended_err) # error
If you append data only based on key-columns you could use data.table as follows:
append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(!all(keys %in% names(to_be_appended))){
    stop("key-columns must be present")
  }
  if(nrow(data.table::merge(DF, to_be_appended, on=keys))==0){
    data.table::setDF(data.table::rbindlist(list(DF, to_be_appended), fill = TRUE))[]
  } else {
    stop("Trying to append douplicated indices")
  }
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With