Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split continuous data into groups?

I have two data sets, the first one with discrete data and the second one with continuous data:

import numpy as np

# discrete
data1 = [1, 1, 2, 2, 2, 3, 4, 4,7, 7, 7, 7, 7, 7]

# continuous
data2 = np.random.normal(size=100)

Now I want to calculate frequencies. It's straightforward for data1, since it contains discrete values:

import collections
c = collections.Counter(data1)
sum = sum(c.values())
for key,val in c.iteritems():
   print([key,float(val)/sum])

How can I do the same for continuous numbers? From theory I know that continuous data must be grouped. Only after this it can be represented as a bar chart. So, how to group the data in python?

like image 441
Klausos Klausos Avatar asked Oct 30 '25 13:10

Klausos Klausos


2 Answers

For numpy, have a look at np.histogram for the continuous data and np.bincount for the discrete data.

As a quick example:

import numpy as np

data1 = [1, 1, 2, 2, 2, 3, 4, 4, 7, 7, 7, 7, 7, 7]
data2 = np.random.normal(size=100)


discrete_counts = np.bincount(data1)
discrete_vals = np.arange(len(discrete_counts))

counts, edges = np.histogram(data2)

If you'd like to plot the results, have a look at plt.hist and plt.bar.

For example:

import numpy as np
import matplotlib.pyplot as plt

data1 = [1, 1, 2, 2, 2, 3, 4, 4, 7, 7, 7, 7, 7, 7]
data2 = np.random.normal(size=100)

fig, axes = plt.subplots(nrows=2)

counts = np.bincount(data1)
vals = np.arange(len(counts))
axes[0].bar(counts, vals, align='center', color='lightblue')
axes[0].set(title='Discrete Data')

axes[1].hist(data2, color='salmon')
axes[1].set(title='Continuous Data')

for ax in axes:
    ax.margins(0.05)
    ax.set_ylim(bottom=0)

plt.show()

enter image description here

If you're using pandas, as @Carsten mentioned, look at the hist function to plot the histogram (similar to plt.hist). However, the equivalent of numpy.histogram is pandas.cut, which is extremely handy when you want the histogram counts (or want to group by a continuous range).

like image 131
Joe Kington Avatar answered Nov 02 '25 03:11

Joe Kington


What you're looking for is called a histogram. You can use numpy.histogram to get one of those from your array. You pass a numpy array and the edges of your groups (or bins, as they are commonly called) to the function, and it will return a 2-tuple, consisting of the number of elements in each bin and the bin edges. Example from the docs:

>>> np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]))

@ajrc mentioned pandas in the comments. If you have a pandas Series (and you can just create one with s = pandas.series(data2), you can create a histogram by calling s.hist(). It will create a histogram with equally-spaced bins over the range of your data (the default number of bins is 10, but you can adjust that by using the bins parameter).

like image 34
Carsten Avatar answered Nov 02 '25 04:11

Carsten



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!