References:
- Refactoring guru content on the Builder pattern (though, like some of the previous patterns, their suggested structure involves some over-complication)
- Builder Design Pattern by Lokesh Gupta.
- Java Stream documentation
Streams and lambdas
- Refresher about lambdas
Function, Predicate, Consumer
- The addition of
Stream
was one of the major features added to Java 8. Stream
s are wrappers around a data source, allowing us to operate with that data source and making bulk processing convenient and fast- A stream does not store data and, in that sense, is not a data structure. It also never modifies the underlying data source.
- This functionality supports functional-style operations on streams of elements, such as map-reduce transformations on collections. The operations can be composed into “stream pipelines”
Creating streams
You can create empty streams, or streams from existing data sources like lists.
Stream<String> s1 = Stream.empty(); // empty stream
// Stream of strings
Stream<String> s2 = Stream.of("these", "are", "stream", "contents");
// Stream of strings
List<String> myList = List.of("This", "is", "a", "list", "of", "Strings");
Stream<String> s3 = myList.stream();
// Using the builder pattern
Stream<String> s4 = Stream.<String>builder()
.add("builder")
.add("pattern")
.add("in action")
.build();
Note that if the Stream comes from an existing data source, it does NOT modify that data source, no matter what operations are performed in the Stream.
Beyond simply creating a stream from pre-existing data, you can generate streams by doing other transformations on data:
Random random = new Random();
DoubleStream ds = random.doubles(3); // Stream of 3 doubles
IntStream intStream = IntStream.range(1, 3);
LongStream longStream = LongStream.rangeClosed(1, 3);
Finally, you can also create streams out of file contents:
Files.lines(Path.of("file.txt"), Charset.defaultCharset())
.forEach(System.out::println);
This is in contrast to Files.readAllLines(Path.of("file.txt"))
, which would read all lines into a List<String>
. This can be time and memory intensive.
The Stream solution loads lines “lazily” and processes them one-at-a-time.
Stream pipelines
In general, a stream pipeline contains of:
- A source, which can be an array, a collection, a generator function, an I/O channel, etc. We talked about this above.
- Zero or more intermediate operations, which transform the stream into another stream. Because these intermediate operations return streams themselves, they can be chained together to perform a number of operations.
- Exactly one terminal operation, which produces a result or a side effect. Since the terminal operation “exits” the pipeline, no further stream operations can be added to the pipeline.
Many stream operations take in a behavioural parameter (i.e., a function). This can be written inline as a lambda, referred to using a variable that points to a lambda, or using the method reference syntax (e.g., System.out::println
).
These behavioural parameters represent functions that must be applied to each item in the stream.
Earlier in the quarter we talked about various “small patterns” we often perform with for loops in imperative languages, like transforming all items in a collection by applying some function to each item (map
), or removing certain items in a collection based on some condition (filter
), or summing up or aggregating in some way the values in a collection (reduce
).
Streams allow us to define “pipelines” of these operations to be performed on collections.
For example, imagine that we have a giant file of strings, and we need to:
- Upper-case each line
- Keep only the lines that include the phrase “SECRET PHRASE”
If our data is in a file called “file.txt”, this would look like
Stream<String> result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
.map(String::toUpperCase)
.filter(l -> l.contains("SECRET PHRASE"));
You’ll notice that we still only have a Stream<String>
after the above code runs.
That’s because all we’ve done is create a pipeline of operations to be run — we haven’t actually executed those operations yet.
The map
and filter
above are intermediate operations.
They are not actually kicked off until a terminal operation is added to the stream pipeline.
As I mentioned earlier, a terminal operation produces some result or side effect, thereby exiting the stream pipeline. Some examples of terminal operations are:
- Collecting the result of the Stream into a list (
.toList()
) - Counting the elements left in the stream after the intermediate operations have been performed (
count()
) - In the case of primitive streams like
IntStream
,DoubleStream
,LongStream
, you can perform numerical aggregations (.sum(), .average()
, etc.)
Here are some examples:
Notice that the type of result
is now List<String>
. It is no longer a stream to which we can add further computations.
List<String> result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
.map(String::toUpperCase)
.filter(l -> l.contains("SECRET PHRASE"))
.toList()
You can specify that a map
should mapToInt
(i.e., map to an IntStream
).
That allows stream operations
OptionalInt result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
.map(String::toUpperCase)
.filter(l -> l.contains("SECRET PHRASE"))
.mapToInt(String::length)
.max();
PONDER Why do you think we get an OptionalInt
instead of a plain old int
in return?
Finally, you can terminate streams with “side effects”, i.e., functions that don’t return a value, but have some other effect (e.g., they change the value of some other variable, or they write to some output stream).
Files.lines(Path.of("file.txt"), Charset.defaultCharset())
.map(String::toUpperCase)
.filter(l -> l.contains("SECRET PHRASE"))
.forEach(System.out::println);
In the code above, we are applying the forEach
terminal operation the stream. In the terminal operation, we are passing each item to the System.out::println
function that you know and love. Recall that the ::
is the method reference syntax — we are “pointing to” the println
function and saying “call this on each item in the stream”.
If lambdas are more your thing, you can write that as l -> System.out.println(l)
. But in general it’s better to use method references for lambdas that are this simple.
Stream pipelines are evaluated lazily
Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed.
This has important implications. For example, consider the following pipeline, where I’ve added print statements in each operation.
OptionalInt result = Files.lines(Path.of("file.txt"), Charset.defaultCharset())
.map(line -> {
System.out.println("Upper-casing " + line);
return line.toUpperCase();
})
.filter(line -> {
System.out.println("\tChecking " + line + " for secret phrase");
return line.contains("SECRET PHRASE")
})
.mapToInt(line -> {
System.out.println("\t\tMapping " + line + " to charlength");
return line.length();
})
.max();
Can you predict what the printed output would be with the following input?
INPUT
here
are
sOME
LINes SEcret phrASE
in
A
File
OUTPUT
Upper-casing here
Checking HERE for secret phrase
Upper-casing are
Checking ARE for secret phrase
Upper-casing sOME
Checking SOME for secret phrase
Upper-casing LINes SEcret phrASE
Checking LINES SECRET PHRASE for secret phrase
Mapping LINES SECRET PHRASE to charlength
Upper-casing in
Checking IN for secret phrase
Upper-casing A
Checking A for secret phrase
Upper-casing File
Checking FILE for secret phrase
The mapToInt
step only applied to one item—the one that survived the previous filtering step.
Rules for behavioural parameters
All behavioural parameters to streams must:
- Be Non-interfering: While a stream pipeline is executing (i.e., its terminal operation has been defined/invoked), its data source must not be modified. This is similar to how you will be get a
ConcurrentModificationException
if you modify a collection while using afor-each
loop to iterate over it. - Be Stateless: A stateful lambda or function is one whose result depends on any state (e.g., instance variables in a class) that might change during execution of the stream pipeline.
- Not have side-effects: Recall that stream operations are lazily applied.
From the Stream documentation
A stream implementation is permitted significant latitude in optimizing the computation of the result. For example, a stream implementation is free to elide operations (or entire stages) from a stream pipeline – and therefore elide invocation of behavioral parameters – if it can prove that it would not affect the result of the computation. This means that side-effects of behavioral parameters may not always be executed and should not be relied upon, unless otherwise specified (such as by the terminal operations
forEach
andforEachOrdered
).
In short, you cannot rely on all stream operations always being executed.
Parallel streams
As mentioned above, a Stream doesn’t kick off until a terminal operation is called. So until that happens, the Stream is still being built up (using the Builder pattern), and all the intermediate operations like map
and filter
are being added to it.
At this point the “Stream” is usually a “sequential stream”. That is, it processes the data in one thread. However, modern computers usually have multiple cores available, meaning they can perform several actions at once. This means that some computations can be sped up if we can split up the problem (or the input data) into subsets, process those subsets, and combine the results.
You can do this by turning the Stream into a Parallel Stream.
Any stream can be told to operate in parallel by calling parallel()
on it. parallel()
is an intermediate operation. (Or, if your stream’s data source is a data structure like a list, you can call parallelStream()
on it instead of stream()
to begin streaming).
While this can result in a significant speedup, there are some important things to be aware of:
- By default, the parallel stream will use up one less than all the available cores. This may be fine for demonstrating parallelism, but in the real world, you often want more control over how much of the computer’s resources will be devoted to a task. For example, your code may be invoked programmatically by some other module, that is itself working on subproblems in parallel, and needs the ability to orchestrate thread management. If your parallel stream kicks off a number of long-running tasks, you will soon effectively block all available threads.
- Parallel processing may actually be slower than sequential processing in some cases. If you’re not working on long-enough running tasks, or not working with enough data, then the overhead of splitting the task and combining results may outweigh any benefits of parallel processing.
- Finally, there can be some “gotchas” while working with parallel streams. Consider the following
reduce
operation, which uses5
as the initial value instead of the default 0.
List.of(1, 2, 3, 4)
.parallelStream()
.reduce(5, Integer::sum);
In normal sequential application, we would get the result 5 + 1 + 2 + 3 + 4 = 15
.
However, in a parallel stream, the reduce
is given to each thread to handle, and 5
is added in each thread. Depending on how many threads are dedicated to this task, we will get different responses.