(Coursenotes for CSC 305 Individual Software Design and Development)

Streams

  • Refresher on the builder pattern
  • Streams and lambdas

References:

Streams and lambdas

  • Refresher about lambdas
    • Function, Predicate, Consumer
  • The addition of Stream was one of the major features added to Java 8.
  • Streams are wrappers around a data source, allowing us to operate with that data source and making bulk processing convenient and fast
  • A stream does not store data and, in that sense, is not a data structure. It also never modifies the underlying data source.
  • This functionality supports functional-style operations on streams of elements, such as map-reduce transformations on collections. The operations can be composed into “stream pipelines”

Creating streams

You can create empty streams, or streams from existing data sources like lists.

Stream<String> s1 = Stream.empty(); // empty stream

// Stream of strings
Stream<String> s2 = Stream.of("these", "are", "stream", "contents");

// Stream of strings
List<String> myList = List.of("This", "is", "a", "list", "of", "Strings");
Stream<String> s3 = myList.stream();

// Using the builder pattern
Stream<String> s4 = Stream.<String>builder()
    .add("builder")
    .add("pattern")
    .add("in action")
    .build();

Note that if the Stream comes from an existing data source, it does NOT modify that data source, no matter what operations are performed in the Stream.

Beyond simply creating a stream from pre-existing data, you can generate streams by doing other transformations on data:

Random random = new Random();
DoubleStream ds = random.doubles(3); // Stream of 3 doubles
IntStream intStream = IntStream.range(1, 3);
LongStream longStream = LongStream.rangeClosed(1, 3);

Finally, you can also create streams out of file contents:

Files.lines(Path.of("file.txt"), Charset.defaultCharset())
    .forEach(System.out::println);

This is in contrast to Files.readAllLines(Path.of("file.txt")), which would read all lines into a List<String>. This can be time and memory intensive. The Stream solution loads lines “lazily” and processes them one-at-a-time.

Stream pipelines

In general, a stream pipeline contains of:

  • A source, which can be an array, a collection, a generator function, an I/O channel, etc. We talked about this above.
  • Zero or more intermediate operations, which transform the stream into another stream. Because these intermediate operations return streams themselves, they can be chained together to perform a number of operations.
  • Exactly one terminal operation, which produces a result or a side effect. Since the terminal operation “exits” the pipeline, no further stream operations can be added to the pipeline.

Many stream operations take in a behavioural parameter (i.e., a function). This can be written inline as a lambda, referred to using a variable that points to a lambda, or using the method reference syntax (e.g., System.out::println). These behavioural parameters represent functions that must be applied to each item in the stream.

Earlier in the quarter we talked about various “small patterns” we often perform with for loops in imperative languages, like transforming all items in a collection by applying some function to each item (map), or removing certain items in a collection based on some condition (filter), or summing up or aggregating in some way the values in a collection (reduce).

Streams allow us to define “pipelines” of these operations to be performed on collections.

For example, imagine that we have a giant file of strings, and we need to:

  • Upper-case each line
  • Keep only the lines that include the phrase “SECRET PHRASE”

If our data is in a file called “file.txt”, this would look like

Stream<String> result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
    .map(String::toUpperCase)
    .filter(l -> l.contains("SECRET PHRASE"));

You’ll notice that we still only have a Stream<String> after the above code runs. That’s because all we’ve done is create a pipeline of operations to be run — we haven’t actually executed those operations yet. The map and filter above are intermediate operations. They are not actually kicked off until a terminal operation is added to the stream pipeline.

As I mentioned earlier, a terminal operation produces some result or side effect, thereby exiting the stream pipeline. Some examples of terminal operations are:

  • Collecting the result of the Stream into a list (.toList())
  • Counting the elements left in the stream after the intermediate operations have been performed (count())
  • In the case of primitive streams like IntStream, DoubleStream, LongStream, you can perform numerical aggregations (.sum(), .average(), etc.)

Here are some examples:

Notice that the type of result is now List<String>. It is no longer a stream to which we can add further computations.

List<String> result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
    .map(String::toUpperCase)
    .filter(l -> l.contains("SECRET PHRASE"))
    .toList()

You can specify that a map should mapToInt (i.e., map to an IntStream). That allows stream operations

OptionalInt result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
    .map(String::toUpperCase)
    .filter(l -> l.contains("SECRET PHRASE"))
    .mapToInt(String::length)
    .max();

PONDER Why do you think we get an OptionalInt instead of a plain old int in return?

Finally, you can terminate streams with “side effects”, i.e., functions that don’t return a value, but have some other effect (e.g., they change the value of some other variable, or they write to some output stream).

Files.lines(Path.of("file.txt"), Charset.defaultCharset())
    .map(String::toUpperCase)
    .filter(l -> l.contains("SECRET PHRASE"))
    .forEach(System.out::println);

In the code above, we are applying the forEach terminal operation the stream. In the terminal operation, we are passing each item to the System.out::println function that you know and love. Recall that the :: is the method reference syntax — we are “pointing to” the println function and saying “call this on each item in the stream”. If lambdas are more your thing, you can write that as l -> System.out.println(l). But in general it’s better to use method references for lambdas that are this simple.

Stream pipelines are evaluated lazily

Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed.

This has important implications. For example, consider the following pipeline, where I’ve added print statements in each operation.

OptionalInt result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
    .map(line -> {
        System.out.println("Upper-casing " + line);
        return line.toUpperCase();
    })
    .filter(line -> {
        System.out.println("\tChecking " + line + " for secret phrase");
        return line.contains("SECRET PHRASE")
    })
    .mapToInt(line -> {
        System.out.println("\t\tMapping " + line + " to charlength");
        return line.length();
    })
    .max();

Can you predict what the printed output would be with the following input?

INPUT

here
are
sOME
LINes SEcret phrASE
in
A
File

OUTPUT

Upper-casing here
        Checking HERE for secret phrase
Upper-casing are
        Checking ARE for secret phrase
Upper-casing sOME
        Checking SOME for secret phrase
Upper-casing LINes SEcret phrASE
        Checking LINES SECRET PHRASE for secret phrase
                Mapping LINES SECRET PHRASE to charlength
Upper-casing in
        Checking IN for secret phrase
Upper-casing A
        Checking A for secret phrase
Upper-casing File
        Checking FILE for secret phrase

The mapToInt step only applied to one item—the one that survived the previous filtering step.

Rules for behavioural parameters

All behavioural parameters to streams must:

  • Be Non-interfering: While a stream pipeline is executing (i.e., its terminal operation has been defined/invoked), its data source must not be modified. This is similar to how you will be get a ConcurrentModificationException if you modify a collection while using a for-each loop to iterate over it.
  • Be Stateless: A stateful lambda or function is one whose result depends on any state (e.g., instance variables in a class) that might change during execution of the stream pipeline.
  • Not have side-effects: Recall that stream operations are lazily applied.

From the Stream documentation

A stream implementation is permitted significant latitude in optimizing the computation of the result. For example, a stream implementation is free to elide operations (or entire stages) from a stream pipeline – and therefore elide invocation of behavioral parameters – if it can prove that it would not affect the result of the computation. This means that side-effects of behavioral parameters may not always be executed and should not be relied upon, unless otherwise specified (such as by the terminal operations forEach and forEachOrdered).

In short, you cannot rely on all stream operations always being executed.

Parallel streams

As mentioned above, a Stream doesn’t kick off until a terminal operation is called. So until that happens, the Stream is still being built up (using the Builder pattern), and all the intermediate operations like map and filter are being added to it.

At this point the “Stream” is usually a “sequential stream”. That is, it processes the data in one thread. However, modern computers usually have multiple cores available, meaning they can perform several actions at once. This means that some computations can be sped up if we can split up the problem (or the input data) into subsets, process those subsets, and combine the results.

You can do this by turning the Stream into a Parallel Stream.

Any stream can be told to operate in parallel by calling parallel() on it. parallel() is an intermediate operation. (Or, if your stream’s data source is a data structure like a list, you can call parallelStream() on it instead of stream() to begin streaming).

While this can result in a significant speedup, there are some important things to be aware of:

  • By default, the parallel stream will use up one less than all the available cores. This may be fine for demonstrating parallelism, but in the real world, you often want more control over how much of the computer’s resources will be devoted to a task. For example, your code may be invoked programmatically by some other module, that is itself working on subproblems in parallel, and needs the ability to orchestrate thread management. If your parallel stream kicks off a number of long-running tasks, you will soon effectively block all available threads.
  • Parallel processing may actually be slower than sequential processing in some cases. If you’re not working on long-enough running tasks, or not working with enough data, then the overhead of splitting the task and combining results may outweigh any benefits of parallel processing.
  • Finally, there can be some “gotchas” while working with parallel streams. Consider the following reduce operation, which uses 5 as the initial value instead of the default 0.

Example from Baeldung

List.of(1, 2, 3, 4)
  .parallelStream()
  .reduce(5, Integer::sum);

In normal sequential application, we would get the result 5 + 1 + 2 + 3 + 4 = 15. However, in a parallel stream, the reduce is given to each thread to handle, and 5 is added in each thread. Depending on how many threads are dedicated to this task, we will get different responses.