Suppose, that we need to sort 50 000 000 numbers. suppose, that the numbers is stored in a file. What is the most efficient algorithm for solving this problem? Parallel algorithm for sorting...
How to do it? Maybe useful link )
Ok.. I read about parallel mergesort... But it's not clear for me.
code is located here
For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.
Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).
Integer Sorting with limited range is achieved efficiently with Bucket Sorting. Bucket sort, counting sort, radix sort, and van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithm.
Selection sort takes one millisecond to sort 1000 items (worst-case time) on a particular computer. Estimate the amount of time it would take to sort 100,000 items.
50 million is not particularly large. I would just read them into memory. Sort them and write them out. It should take just a few seconds. How fast do you need it be? How compilcated do you need it to be?
On my old labtop it took 28 seconds. If I had more processors, it might be a little faster but much of the time is spent reading and writing the file (15 seconds) which wouldn't be any faster.
One of the critical factors is the size of your cache. The comparison itself is very cheap provided the data is in cache. As the L3 cache is shared, one thread is all you need to make full use of it.
public static void main(String...args) throws IOException {
    generateFile();
    long start = System.currentTimeMillis();
    int[] nums = readFile("numbers.bin");
    Arrays.sort(nums);
    writeFile("numbers2.bin", nums);
    long time = System.currentTimeMillis() - start;
    System.out.println("Took "+time+" secs to sort "+nums.length+" numbers.");
}
private static void generateFile() throws IOException {
    Random rand = new Random();
    int[] ints = new int[50*1000*1000];
    for(int i= 0;i<ints.length;i++)
        ints[i] = rand.nextInt();
    writeFile("numbers.bin", ints);
}
private static int[] readFile(String filename) throws IOException {
    DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(filename), 64*1024));
    int len = dis.readInt();
    int[] ints = new int[len];
    for(int i=0;i<len;i++)
        ints[i] = dis.readInt();
    return ints;
}
private static void writeFile(String name, int[] numbers) throws IOException {
    DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(name), 64*1024));
    dos.writeInt(numbers.length);
    for (int number : numbers)
        dos.writeInt(number);
    dos.close();
}
From top of my head, merge sort seems to be the best option when it comes to parallelisation and distribution, as it uses divide-and-conquer approach. For more information, google for "parallel merge sort" and "distributed merge sort".
For single-machine, multiple cores example, see see Correctly multithreaded quicksort or mergesort algo in Java?. If you can use Java 7 fork/join then see: "Java 7: more concurrency" and "Parallelism with Fork/Join in Java 7".
For distributing it over many machines, see Hadoop, it has a distributed merge sort implementation: see MergeSort and MergeSorter. Also of interest: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With