I am trying to Benchmark our Client code. So I decided I will write a multithreading program to do the benchmarking of my client code. I am trying to measure how much time (95 Percentile) below method will take-
attributes = deClient.getDEAttributes(columnsList);
So below is the multithreaded code I wrote to do the benchmarking on the above method. I am seeing lot of variations in my two scenarios-
1) Firstly, with multithreaded code by using 20 threads and running for 15 minutes. I get 95 percentile as 37ms. And I am using-
ExecutorService service = Executors.newFixedThreadPool(20);
2) But If I am running my same program for 15 minutes using-
ExecutorService service = Executors.newSingleThreadExecutor(); 
instead of
ExecutorService service = Executors.newFixedThreadPool(20);
I get 95 percentile as 7ms which is way less than the above number when I am running my code with newFixedThreadPool(20).
Can anyone tell me what can be the reason for such high performance issues with-
newSingleThreadExecutor vs newFixedThreadPool(20)
And by both ways I am running my program for 15 minutes.
Below is my code-
public static void main(String[] args) {
    try {
        // create thread pool with given size
        //ExecutorService service = Executors.newFixedThreadPool(20);
        ExecutorService service = Executors.newSingleThreadExecutor();
        long startTime = System.currentTimeMillis();
        long endTime = startTime + (15 * 60 * 1000);//Running for 15 minutes
        for (int i = 0; i < threads; i++) {
            service.submit(new ServiceTask(endTime, serviceList));
        }
        // wait for termination        
        service.shutdown();
        service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
    } catch (InterruptedException e) {
    } catch (Exception e) {
    }
}
Below is the class that implements Runnable interface-
class ServiceTask implements Runnable {
    private static final Logger LOG = Logger.getLogger(ServiceTask.class.getName());
    private static Random random = new SecureRandom();
    public static volatile AtomicInteger countSize = new AtomicInteger();
    private final long endTime;
    private final LinkedHashMap<String, ServiceInfo> tableLists;
    public static ConcurrentHashMap<Long, Long> selectHistogram = new ConcurrentHashMap<Long, Long>();
    public ServiceTask(long endTime, LinkedHashMap<String, ServiceInfo> tableList) {
        this.endTime = endTime;
        this.tableLists = tableList;
    }
    @Override
    public void run() {
        try {
            while (System.currentTimeMillis() <= endTime) {
                double randomNumber = random.nextDouble() * 100.0;
                ServiceInfo service = selectRandomService(randomNumber);
                final String id = generateRandomId(random);
                final List<String> columnsList = getColumns(service.getColumns());
                List<DEAttribute<?>> attributes = null;
                DEKey bk = new DEKey(service.getKeys(), id);
                List<DEKey> list = new ArrayList<DEKey>();
                list.add(bk);
                Client deClient = new Client(list);
                final long start = System.nanoTime();
                attributes = deClient.getDEAttributes(columnsList);
                final long end = System.nanoTime() - start;
                final long key = end / 1000000L;
                boolean done = false;
                while(!done) {
                    Long oldValue = selectHistogram.putIfAbsent(key, 1L);
                    if(oldValue != null) {
                        done = selectHistogram.replace(key, oldValue, oldValue + 1);
                    } else {
                        done = true;
                    }
                }
                countSize.getAndAdd(attributes.size());
                handleDEAttribute(attributes);
                if (BEServiceLnP.sleepTime > 0L) {
                    Thread.sleep(BEServiceLnP.sleepTime);
                }
            }
        } catch (Exception e) {
        }
    }
}
Updated:-
Here is my processor spec- I am running my program from Linux machine with 2 processors defined as:
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping        : 7
cpu MHz         : 2599.999
cache size      : 20480 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm arat pln pts
bogomips        : 5199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
Can anyone tell me what can be the reason for such high performance issues with
newSingleThreadExecutorvsnewFixedThreadPool(20)...
If you are running many more tasks in parallel (20 in the case) than you have processors (I doubt that you have 20+ processor box) then yes, each individual task is going to take longer to complete. It is easier for the computer to execute one task at a time instead of switching between multiple threads running at the same time. Even if you limit the number of threads in the pool to the number of CPUs you have, each task probably will run slower, albeit slightly.
If, however, you compare the throughput (amount of time needed to complete a number of tasks) of your different sized thread pools, you should see that the 20 thread throughput should be higher. If you execute 1000 tasks with 20 threads, they overall will finish much sooner than with just 1 thread. Each task may take longer but they will be executing in parallel. It will probably not be 20 times faster given thread overhead, etc. but it might be something like 15 times faster.
You should not be worrying about the individual task speed but rather you should be trying to maximize the task throughput by tuning the number of threads in your pool. How many threads to use depends heavily on the amount of IO, the CPU cycles used by each task, locks, synchronized blocks, other applications running on the OS, and other factors.
People often use 1-2 times the number of CPUs as a good place to start in terms of the number of threads in the pool to maximize throughput. More IO requests or thread blocking operations then add more threads. More CPU bound then reduce the number of threads to be closer to the number of CPUs available. If your application is competing for OS cycles with other more important applications on the server then even less threads may be required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With