I am looking for a more efficient equivalent of
df.reindex(labels).dropna(subset=[0])
that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.
Equivalently I am loooking for an efficient version of
df.loc[labels]
that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.
I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.
Here is a concrete demonstration of the issue following on from @MaxU's answer:
In [2]: L = 10**7
   ...: M = 10**4
   ...: N = 10**9
   ...: np.random.seed([3, 1415])
   ...: df = pd.DataFrame(np.random.rand(L, 2))
   ...: labels = np.random.randint(N, size=M)
   ...: M-len(set(labels)) 
   ...: 
   ...: 
Out[2]: 0
In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.
So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original 
In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.
To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index
In [16]: df.index=df.index.map(str)
    ...: labels = np.array(list(map(str, labels)))
    ...: 
    ...: 
In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.
The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.
Practical Data Science using Python To create a DataFrame from DateTimeIndex ignoring the index, use the DateTimeIndex. to_frame() method. Set the parameter index to False to ignore the index.
iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .
Dealing with index and axisIf you want the concatenation to ignore existing indices, you can set the argument ignore_index=True . Then, the resulting DataFrame index will be labeled with 0 , …, n-1 .
You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.
Here is a small comparison for different approaches.
Sample DF (shape: 10.000.000 x 2):
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)
In [88]: df.shape
Out[88]: (10000000, 2)
valid (existing labels):
In [89]: (labels <= 10**7).sum()
Out[89]: 1008
invalid (not existing labels):
In [90]: (labels > 10**7).sum()
Out[90]: 98992
Timings:
In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With