(Coursenotes for CSC 305 Individual Software Design and Development)

Serialization, File I/O

This lecture is mostly prep for project 3.

File I/O in Java

So far, we have handled applications that only operate on data stored in main memory (RAM). Data in RAM is super-fast to access compared to data stored on disk or on other storage devices. Differences in data read and write times can be orders of magnitude slower in disk-based applications compared to memory-bound applications.

In previous projects and courses, you’ve had some experience reading from and writing to files. Most of the time we use language libraries for this interaction with files without thinking too much about it. For example, Java provides the Scanner API for reading data from files.

You all used the Scanner API to implement your TUIs in a previous project and lab. There were no files involved there, because the “source” of input was System.in, or the “standard input stream”.

Scanner scan = new Scanner(System.in);

If you wanted to read from a file instead, you would initialise the Scanner like so:

// You need to wrap the scanner in a try-catch if you want to read a file
try (Scanner scan = new Scanner(new File("my text file.txt"))) {
  // do stuff with the file's contents
  // the same Scanner API applies: scan.nextLine(), scan.next(), etc.
} catch (FileNotFoundException e) {
  // handle the scenario where the file doesn't exist
}

Similarly, Java provides a FileWriter for writing to files:

try (FileWriter writer = new FileWriter(new File("some text file.txt"))) {
  // write data to the file
  // API calls like writer.write("some text..."); 
} catch (IOException e) {
  // handle the exception 
}

Random access files

Reading files with a Scanner is convenient, but comes with a couple of limitations.

The Scanner is limited to reading text. When you say something like scanner.nextLine() to get a line from the input source, or scanner.next() to get a “token” delimited by whitespace, you are assuming that the data you’re reading from the input source is textual data. You can’t use a Scanner to read things like raw binary data.

The Scanner cannot jump to arbitrary positions in the file. You must read the file from start to finish, using methods like next() which gets you the “word” or nextLine() which gets you the next line. There is no moving backward, and there is no moving forward without reading the file along the way.

These are good limitations! Most of the time we do want to only read text from a file (words, ints, lines, spaces, etc.), so an interface that’s on the same page as us is a good thing.

There are also performance reasons why it’s a good assumption that you’ll want to read data sequentially.

RandomAccessFiles

However, sometimes we need something a little less controlled. That is, there are situations in which we need to read raw bytes from a file that cannot be interpreted as text (for example, if we are storing compressed data that needs to be decompressed according to some protocol). And sometimes we need to be able to quickly access data from arbitrary locations in the file, without reading up to that location (for example, consider how large database systems like MySQL or PostgreSQL store and access their data on disk).

The Java standard library provides the RandomAccessFile API to allow this.

Here are some key concepts in and methods from this API:

Notice that all of the operations above are in terms of bytes, instead of nice logical chunks of data that could be interpreted to mean, say, unicode characters. Like we said above, sometimes this is useful depending on what you want to do.

But it’s also a matter of necessity: if you’re reading arbitrary amounts of data from a file starting at some arbitrary position, there’s no guarantee that you’re getting a logically meaningful chunk of data. So the RandomAccessFile API works in terms of bytes, and it’s up to you to “decode” the bytes into something meaningful.

There are, however, some “convenience methods”:

The RandomAccessFile also provides methods like readDouble, readChar, etc. These are kind of doing what the Scanner’s nextInt does: they read the requisite number of bytes (e.g., 4 bytes for int, 8 bytes for double) and treat those bytes as the appropriate data type. It’s up to you to ensure that, for example, the next 4 bytes actually contain a meaningful integer.

(In a Scanner, the program will crash if you call nextInt and there actually isn’t an integer coming up. In a RandomAccessFile, it will simply read the next four bytes and interpret them as an int, whether or not that’s “correct” or what you wanted. So tread carefully.)

Serialization

This leads us into our next topic.

Serialization is the conversion of an object (or some piece of data) to a byte stream. Deserialization is the process of turning a byte stream back into the object (or the original piece of data).

They are sometimes called marshalling and unmarshalling.

There are many reasons why we might want to serialize data:

Of course, there exist multiple structured data formats for exporting data to files or sending data over the network, like JSON, XML, and YAML. Those are certainly much more friendly and human-readable than writing out raw byte streams.

However, these exports tend to be much bigger, since those formats use plain text to represent the data, and include extraneous data (like colons :, braces { }, brackets [ ], whitespace " ", and plain text as opposed to raw bytes). So data written out in formats like JSON tend to be more human-readable while still being structured enough to be parsed by programs, but they tend to have a larger memory footprint as well.

Hence, we sometimes opt to serialize our data into raw byte streams.

An example

Suppose we are trying to serialize a short (an integer data type in Java that takes up two bytes of memory).

First recognise that the short is an abstraction. The computer doesn’t know what an integer or a short is; all it knows is how to read bits and bytes. We (humans) decide that in certain contexts, certain sequences of bits and bytes mean certain human-sensible things (like integers, booleans, or characters).

So to serialise this short, our first task is to become “less abstract”—we’re going from the abstract human-friendly representation (the number 31543) to a less abstract representation (a byte array).

We use the ByteBuffer class to help with this.

// This is our number we want to serialise.
short shortNum = 31543;

// Create an empty ByteBuffer, with room for two bytes.
ByteBuffer bb = ByteBuffer.allocate(2);

// Put the short into the ByteBuffer and we can obtain the array of raw bytes.
bb.putShort(shortNum);
byte[] asArray = bb.array();

We can now use the RandomAccessFile API to write out this byte array to a file.

// Assuming that myFile.dat exists.
RandomAccessFile randomAccessFile = new RandomAccessFile(new File("myFile.dat"), "rw");
randomAccessFile.write(asArray);

The entire asArray byte array has been written to the random access file.

Note that this moves the file’s pointer offset forwards two bytes! So any future reads will happen from that point onward. If you wanted to read the two bytes back into memory, you would need to move the cursor back first.

In the code below, we read back the two bytes we just wrote.

// Prepare the byte array into which you'll read data.
byte[] fromFile = new byte[2];

// Move the pointer back to where you want to start reading.
// In this case, the beginning of the file.
// If you forget to do this, your program will fail silently and subtly.
raf.seek(0);

// Read in fromFile.length bytes and place them in the array.
raf.read(fromFile); 

// Get the short back. We are in human-readable land again!
short num = ByteBuffer,wrap(fromFile).getShort(); 

System.out.println(num); // Prints 31543.