Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the year with the highest population (most efficient solution)

Given two arrays; $births containing a list of birth years indicating when someone was born, and $deaths containing a list of death years indicating when someone died, how can we find the year on which the population was highest?

For example given the following arrays:

$births = [1984, 1981, 1984, 1991, 1996];
$deaths = [1991, 1984];

The year on which the population was highest should be 1996, because 3 people were alive during that year, which was the highest population count of all those years.

Here's the running math on that:

| Birth | Death | Population |
|-------|-------|------------|
| 1981  |       | 1          |
| 1984  |       | 2          |
| 1984  | 1984  | 2          |
| 1991  | 1991  | 2          |
| 1996  |       | 3          |

Assumptions

We can safely assume that the year on which someone is born the population can increase by one and the year on which someone died the population can decrease by one. So in this example, 2 people were born on 1984 and 1 person died on 1984, meaning the population increased by 1 on that year.

We can also safely assume that the number of deaths will never exceed the number of births and that no death can occur when the population is at 0.

We can also safely assume that the years in both $deaths and $births will never be negative or floating point values (they're always positive integers greater than 0).

We cannot assume that the arrays will be sorted or that there won't be duplicate values, however.

Requirements

We must write a function to return the year on which the highest population occurred, given these two arrays as input. The function may return 0, false, "", or NULL (any falsey value is acceptable) if the input arrays are empty or if the population was always at 0 throughout. If the highest population occurred on multiple years the function may return the first year on which the highest population was reached or any subsequent year.

For example:

$births = [1997, 1997, 1997, 1998, 1999];
$deaths = [1998, 1999];

/* The highest population was 3 on 1997, 1998 and 1999, either answer is correct */

Additionally, including the Big O of the solution would be helpful.


My best attempt at doing this would be the following:

function highestPopulationYear(Array $births, Array $deaths): Int {

    sort($births);
    sort($deaths);

    $nextBirthYear = reset($births);
    $nextDeathYear = reset($deaths);

    $years = [];
    if ($nextBirthYear) {
        $years[] = $nextBirthYear;
    }
    if ($nextDeathYear) {
        $years[] = $nextDeathYear;
    }

    if ($years) {
        $currentYear = max(0, ...$years);
    } else {
        $currentYear = 0;
    }

    $maxYear = $maxPopulation = $currentPopulation = 0;

    while(current($births) !== false || current($deaths) !== false || $years) {

        while($currentYear === $nextBirthYear) {
            $currentPopulation++;
            $nextBirthYear = next($births);
        }

        while($currentYear === $nextDeathYear) {
            $currentPopulation--;
            $nextDeathYear = next($deaths);
        }

        if ($currentPopulation >= $maxPopulation) {
            $maxPopulation = $currentPopulation;
            $maxYear = $currentYear;
        }

        $years = [];

        if ($nextBirthYear) {
            $years[] = $nextBirthYear;
        }
        if ($nextDeathYear) {
            $years[] = $nextDeathYear;
        }
        if ($years) {
            $currentYear = min($years);
        } else {
            $currentYear = 0;
        }
    }

    return $maxYear;
}

The algorithm above should work in polynomial time given it is at worst O(((n log n) * 2) + k) where n is number of elements to be sorted from each array and k is number of birth years (since we know that k is always k >= y) where y is number of death years. However, I'm not sure if there is a more efficient solution.

My interests are purely in an improved Big O of computational complexity upon the existing algorithm. Memory complexity is of no concern. Nor is the runtime optimization. At least it's not a primary concern. Any minor/major runtime optimizations are welcome, but not the key factor here.

like image 807
Sherif Avatar asked Feb 23 '20 17:02

Sherif


4 Answers

We can solve this in linear time with bucket sort. Let's say the size of the input is n, and the range of years is m.

O(n): Find the min and max year across births and deaths.
O(m): Create an array of size max_yr - min_yr + 1, ints initialized to zero. 
      Treat the first cell of the array as min_yr, the next as min_yr+1, etc...
O(n): Parse the births array, incrementing the appropriate index of the array. 
      arr[birth_yr - min_yr] += 1
O(n): Ditto for deaths, decrementing the appropriate index of the array.
      arr[death_yr - min_yr] -= 1
O(m): Parse your array, keeping track of the cumulative sum and its max value.

The largest cumulative maximum is your answer.

The running time is O(n+m), and the additional space needed is O(m).

This is a linear solution in n if m is O(n); i.e., if the range of years isn't growing more quickly than the number of births and deaths. This is almost certainly true for real world data.

like image 174
Dave Avatar answered Nov 15 '22 06:11

Dave


I solved this problem with a memory requirement of O(n+m) [in worst case, best case O(n)]

and, time complexity of O(n logn).

Here, n & m are the length of births and deaths arrays.

I don't know PHP or javascript. I've implemented it with Java and the logic is very simple. But I believe my idea can be implemented in those languages as well.

Technique Details:

I used java TreeMap structure to store births and deaths records.

TreeMap inserts data sorted (key based) as (key, value) pair, here key is the year and value is the cumulative sum of births & deaths (negative for deaths).

We don't need to insert deaths value that happened after the highest birth year.

Once the TreeMap is populated with the births & deaths records, all the cumulative sums are updated and store the maximum population with year as it progressed.

Sample input & output: 1

Births: [1909, 1919, 1904, 1911, 1908, 1908, 1903, 1901, 1914, 1911, 1900, 1919, 1900, 1908, 1906]

Deaths: [1910, 1911, 1912, 1911, 1914, 1914, 1913, 1915, 1914, 1915]

Year counts Births: {1900=2, 1901=1, 1903=1, 1904=1, 1906=1, 1908=3, 1909=1, 1911=2, 1914=1, 1919=2}

Year counts Birth-Deaths combined: {1900=2, 1901=1, 1903=1, 1904=1, 1906=1, 1908=3, 1909=1, 1910=-1, 1911=0, 1912=-1, 1913=-1, 1914=-2, 1915=-2, 1919=2}

Yearwise population: {1900=2, 1901=3, 1903=4, 1904=5, 1906=6, 1908=9, 1909=10, 1910=9, 1911=9, 1912=8, 1913=7, 1914=5, 1915=3, 1919=5}

maxPopulation: 10
yearOfMaxPopulation: 1909

Sample input & output: 2

Births: [1906, 1901, 1911, 1902, 1905, 1911, 1902, 1905, 1910, 1912, 1900, 1900, 1904, 1913, 1904]

Deaths: [1917, 1908, 1918, 1915, 1907, 1907, 1917, 1917, 1912, 1913, 1905, 1914]

Year counts Births: {1900=2, 1901=1, 1902=2, 1904=2, 1905=2, 1906=1, 1910=1, 1911=2, 1912=1, 1913=1}

Year counts Birth-Deaths combined: {1900=2, 1901=1, 1902=2, 1904=2, 1905=1, 1906=1, 1907=-2, 1908=-1, 1910=1, 1911=2, 1912=0, 1913=0}

Yearwise population: {1900=2, 1901=3, 1902=5, 1904=7, 1905=8, 1906=9, 1907=7, 1908=6, 1910=7, 1911=9, 1912=9, 1913=9}

maxPopulation: 9
yearOfMaxPopulation: 1906

Here, deaths occurred (1914 & later) after the last birth year 1913, was not counted at all, that avoids unnecessary computations.

For a total of 10 million data (births & deaths combined) and over 1000 years range, the program took about 3 sec. to finish.

If same size data with 100 years range, it took 1.3 sec.

All the inputs are randomly taken.

like image 24
User_67128 Avatar answered Nov 15 '22 08:11

User_67128


I think we can have O(n log n) time with O(1) additional space by first sorting, then maintaining a current population and global maximum as we iterate. I tried to use the current year as a reference point but the logic still seemed a bit tricky so I'm not sure it's completely worked out. Hopefully, it can give an idea of the approach.

JavaScript code (counterexamples/bugs welcome)

function f(births, deaths){
  births.sort((a, b) => a - b);
  deaths.sort((a, b) => a - b);

  console.log(JSON.stringify(births));
  console.log(JSON.stringify(deaths));
  
  let i = 0;
  let j = 0;
  let year = births[i];
  let curr = 0;
  let max = curr;

  while (deaths[j] < births[0])
    j++;

  while (i < births.length || j < deaths.length){
    while (year == births[i]){
      curr = curr + 1;
      i = i + 1;
    }
    
    if (j == deaths.length || year < deaths[j]){
      max = Math.max(max, curr);
      console.log(`year: ${ year }, max: ${ max }, curr: ${ curr }`);
    
    } else if (j < deaths.length && deaths[j] == year){
      while (deaths[j] == year){
        curr = curr - 1;
        j = j + 1;
      }
      max = Math.max(max, curr);
      console.log(`year: ${ year }, max: ${ max }, curr: ${ curr }`);
    }

    if (j < deaths.length && deaths[j] > year && (i == births.length || deaths[j] < births[i])){
      year = deaths[j];
      while (deaths[j] == year){
        curr = curr - 1;
        j = j + 1;
      }
      console.log(`year: ${ year }, max: ${ max }, curr: ${ curr }`);
    }

    year = births[i];
  }
  
  return max;
}

var input = [
  [[1997, 1997, 1997, 1998, 1999],
  [1998, 1999]],
  [[1, 2, 2, 3, 4],
  [1, 2, 2, 5]],
  [[1984, 1981, 1984, 1991, 1996],
  [1991, 1984, 1997]],
  [[1984, 1981, 1984, 1991, 1996],
  [1991, 1982, 1984, 1997]]
]

for (let [births, deaths] of input)
  console.log(f(births, deaths));

If the year range, m, is on the order of n, we could store the counts for each year in the range and have O(n) time complexity. If we wanted to get fancy, we could also have O(n * log log m) time complexity, by using a Y-fast trie that allows successor lookup in O(log log m) time.

like image 3
גלעד ברקן Avatar answered Nov 15 '22 06:11

גלעד ברקן


First aggregate the births and deaths into a map (year => population change), sort that by key, and calculate the running population over that.

This should be approximately O(2n + n log n), where n is the number of births.

$births = [1984, 1981, 1984, 1991, 1996];
$deaths = [1991, 1984];

function highestPopulationYear(array $births, array $deaths): ?int
{
    $indexed = [];

    foreach ($births as $birth) {
        $indexed[$birth] = ($indexed[$birth] ?? 0) + 1;
    }

    foreach ($deaths as $death) {
        $indexed[$death] = ($indexed[$death] ?? 0) - 1;
    }

    ksort($indexed);

    $maxYear = null;
    $max = $current = 0;

    foreach ($indexed as $year => $change) {
        $current += $change;
        if ($current >= $max) {
            $max = $current;
            $maxYear = $year;
        }
    }

    return $maxYear;
}

var_dump(highestPopulationYear($births, $deaths));
like image 3
Richard van Velzen Avatar answered Nov 15 '22 07:11

Richard van Velzen