When I have free time, I attend contests in Online Judge websites just for fun. To get AC (passed all test cases
and accepted) result in these contests, your solution has to be time and memory efficient. Most of the problems have several
test cases to evaluate solutions. So, reading those test cases from file and outputting your results into another file to check
may take much time as File Input and Output (IO) operations are toooo slow compared to other operations in most programming languages.
I decided to benchmark all possible ways of reading and writing data into files in Java. Now I present you some
results I found. Note that this post only shows results without any tuning disk file IO such as low level caching.
Here are some approaches in Java with explanations you can use for writing into
file and reading from it. These benchmarks are done using Intel® Core™ i3-3120M CPU @ 2.50GHz × 4 in 64-bit Ubuntu 14.04
with Java SE 1.8.0_20 on SSD. There are several factors(CPU, Hard disk and OS)
that can make these benchmark results different.
Hovewer, the relative difference is almost the same in all machines.
|| 1 MB
|| 10 MB
|| 100 MB
|| 1000 MB
|| 600 ms
|| 4500 ms
|| 41200 ms
|| 425400 ms
|| 30 ms
|| 350 ms
|| 3300 ms
|| 32100 ms
|| 60 ms
|| 390 ms
|| 3400 ms
|| 32900 ms
|| 50 ms
|| 80 ms
|| 280 ms
|| 3300 ms
|| 20 ms
|| 25 ms
|| 160 ms
|| 1450 ms
What the Difference! You may be wondering why the famous and mostly used input stream approach is so slow.
Main reason for this is that FileInputStream
calls native read() method for each next 1 byte (raw octet, 8 bit) of the file content. When the file size is 1 MB, read() method is called
1048576 times that is 1024 x 1024. Now you can imagine how many times it is called for 1000MB. So, when working
with Files, the first rule that you need to consider to speed up your application is the number of method calls()
to underlying system and disk.
Rule 1. Try to reduce number of method calls to underlying system and disk.
To reduce those method calls in FileInputStream, we can read some amount of data (1 MB or 2 MB) from disk and
store it in memory then use this stored data. In this case, we dont make access to disk for each next byte.
As accessing to memory is much faster than accessing to disk, we can speed up our reading operation. BufferedInputStream
use the same strategy to make File IO faster by storing data in buffer (8192 bytes). When read() method is callled
in BufferedStreams, data is read from buffered array in memory and rarely accesses to undelying system.
We can create our own buffer array with bigger size and speed up reading. The Direct BufferedInputStream appraoch
shows benchmark of using 2MB byte array as buffer. As you can see, that is nearly 10 times faster than Java's
default Bufferedstream usage when it comes to read data from 1000 MB file. You can buffer the whole file into byte
array as long as you have enough space in memory.
Rule 2. Use your own buffer with certain size according to your business logic.
If you pay attention above, we only discussed about reading bytes from file. Hovewer, there are many cases
that we want to read content of the file as characters. Unlike BufferedStreams, BufferedReader is all about
reading characters. In the background, it reads bytes (like streams) from disk and translates into characters.
You can see the BufferedReader works little slower than BufferedStreams because of translaton proccess.
Next comes MemoryByteBuffer, which is Memory Mapped IO Class in Java. So, What is Memory Mapped file in Java?
Memory mapped files are new feature in Java nio package, which allows to access contents of file directly from
memory. This is achieved by mapping whole file or portion of file into memory and operating system takes care of
loading page requested and writing into file while application only deals with memory which results
in very fast IO operations. Memory used to load Memory mapped file is outside of Java heap Space. We use MappedByteBuffer
in Java to read and write from memory.
Memory Mapped files are so fast that they are used when performance is very important. However, there are some
tradeoffs such as occurance of page faults if requested page is not in memory.
If your application is dealing with small size of files, choosing MappedByteBuffer or BufferedStreams does
not make much difference. Hovewer, reading data from large files using and changing certain pages of the file
using multiple cores makes huge impact in performance. As Mapped Files loads the whole or some region of the file
into shared memory, every process does not have to load the file into memory.