What's a quick, scalable way to convert the integers 1 through N to a corresponding sequence of strings "A", "B", ... "Z", "AA", "AB", ... of the same length?
Alternatively, I'd be happy with something maps the integer vector onto a character vector such that each element of the character vector has the same number of characters. E.g. 1, 2, ... 27 => "AA", "AB", ..., "AZ", "BA"
Example input:
num_vec <- seq(1, 1000)
char_vec <- ???
UPDATE
My hackish, but best working attempt:
library(data.table)
myfunc <- function(n){
if(n <= 26){
dt <- CJ(LETTERS)[, Result := paste0(V1)]
} else if(n <= 26^2){
dt <- CJ(LETTERS, LETTERS)[, Result := paste0(V1, V2)]
} else if(n <= 26^3){
dt <- CJ(LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3)]
} else if(n <= 26^4){
dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4)]
} else if(n <= 26^5){
dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4, V5)]
} else if(n <= 26^6){
dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4, V5, V6)]
} else{
stop("n too large")
}
return(dt$Result[1:n])
}
myfunc(10)
Several nice solutions were posted in the comments already. Only the solution posted by @Gregor here is currently giving the preferred solution by Ben.
However, the methods posted by @eddi, @DavidArenburg and @G.Grothendieck can be adapted to get the prefered outcome as well:
# adaptation of @eddi's method:
library(data.table)
n <- 29
sz <- ceiling(log(n)/log(26))
do.call(CJ, replicate(sz, c("", LETTERS), simplify = F))[-1, unique(Reduce(paste0, .SD))][1:n]
# adaptation of @DavidArenburg's method:
n <- 29
list(LETTERS, c(LETTERS, do.call(CJ, replicate((n - 1) %/% 26 + 1, LETTERS, simplify = FALSE))[, do.call(paste0, .SD)][1:(n-26)])[[(n>26)+1]]
# adaptation of @G.Grothendieck's method:
n <- 29
sz <- ceiling(log(n)/log(26))
g <- expand.grid(c('',LETTERS), rep(LETTERS, (sz-1)))
g <- g[order(g$Var1),]
do.call(paste0, g)[1:n]
All three result in:
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "AA" "AB" "AC"
This seems like an awesome candidate for Rcpp. Below is the very simple approach:
// [[Rcpp::export]]
StringVector combVec(CharacterVector x, CharacterVector y) {
int nx = x.size();
int ny = y.size();
CharacterVector z(nx*ny);
int k = 0;
for (int i = 0; i < nx; i++) {
for (int j = 0; j < ny; j++) {
z[k] = x[i];
z[k] += y[j];
k++;
}
}
return z;
}
NumChar <- function(n) {
t <- trunc(log(n, 26))
ch <- LETTERS
for (i in t:1L) {ch <- combVec(ch, LETTERS)}
ch[1:n]
}
The result is exactly what the OP's answer returns.
library(data.table)
Rcpp::sourceCpp('combVec.cpp')
identical(myfunc(100000), NumChar(100000))
[1] TRUE
head(NumChar(100000))
[1] "AAAA" "AAAB" "AAAC" "AAAD" "AAAE" "AAAF"
tail(NumChar(100000))
[1] "FRXY" "FRXZ" "FRYA" "FRYB" "FRYC" "FRYD"
Updated benchmarks including @eddi's excellent Rcpp implementation:
library(microbenchmark)
microbenchmark(myfunc(10000), funEddi(10000), NumChar(10000), excelCols(10000, LETTERS))
Unit: microseconds
expr min lq mean median uq max neval cld
myfunc(10000) 6632.125 7255.454 8441.7770 7912.4780 9283.660 14184.971 100 c
funEddi(10000) 12012.673 12869.928 15296.3838 13870.7050 16425.907 80443.142 100 d
NumChar(10000) 2592.555 2883.394 3326.9292 3167.4995 3574.300 6051.273 100 b
excelCols(10000, LETTERS) 636.165 656.820 782.7679 716.9225 811.148 1386.673 100 a
microbenchmark(myfunc(100000), funEddi(100000), NumChar(100000), excelCols(100000, LETTERS), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
myfunc(1e+05) 203.992591 210.049303 255.049395 220.74955 262.52141 397.03521 10 c
funEddi(1e+05) 523.934475 530.646483 563.853995 552.83903 577.88915 688.84714 10 d
NumChar(1e+05) 82.216802 83.546577 97.615537 93.63809 112.14316 115.84911 10 b
excelCols(1e+05, LETTERS) 7.480882 8.377266 9.562554 8.93254 11.10519 14.11631 10 a
As @DirkEddelbuettel says "Rcpp is not some magic pony...". These discrepancies in efficiency just show that although Rcpp, or any package for that matter, is super awesome, they won't fix crappy code. Thanks @eddi for posting a proper Rcpp implementation.
Here's a fast Rcpp solution which will be orders of magnitude faster than native R solutions:
cppFunction('CharacterVector excelCols(int n, CharacterVector x) {
CharacterVector res(n);
int sz = x.size();
std::string base;
int baseN[100] = {0}; // being lazy about size here - you will never grow larger than this
for (int i = 0; i < n; ++i) {
bool incr = false;
for (int j = base.size() - 1; j >= 0 && !incr; --j) {
if (baseN[j] == sz) {
baseN[j] = 1;
base[j] = as<std::string>(x[0])[0];
} else {
baseN[j] += 1;
base[j] = as<std::string>(x[baseN[j] - 1])[0];
incr = true;
}
}
if (!incr) {
baseN[base.size()] = 1;
base += x[0];
}
res[i] = base;
}
return res;
}')
excelCols(100, LETTERS)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With