Overview
Having learned about lambdas in the previous lesson, we will learn about a related construct in Java called streams. Lambdas and streams are often used together.
Streams allow us to take a series of computations we mean to perform on a collection of data, and compose them into a “pipeline”.
In this lesson, we’ll start with concrete examples of using the Streams API, since on the surface there is very little new or unfamiliar happening here. Following this, there is a brief discussion about what exactly is meant by “streaming”, and some of the underlying properties of streams in Java that are important to know about.
An example problem
Before we start, let’s recall the map and filter patterns that we talked about in the previous lesson.
- In the map pattern, we define a function that describes a computation that we want to perform on each item in a collection. The map applies that function to each item in the list and returns a new list containing the results (i.e., the result of applying the function to each item in the original list).
- In the filter pattern, we define a predicate that describes a condition we want to check for each item in a collection. The filter tests each item against that condition, and returns a list containing the items that “pass” or “satisfy” the predicate.
Continuing with our examples of Album
objects, let’s suppose we have a list of Album
s that we are working with.
For the purposes of this example, let’s assume Album
objects have the following fields:
String title
String artist
int year
long unitsSold
double price
We are given the following problem prompt:
Write a program that consumes a list of
Album
objects, and, for theAlbum
s released after the year 2000, computes the average number of units they have sold.
Let’s consider two solutions to this problem: one using regular for-each
loops, like we are used to, and one using streams and lambdas.
A for-each
loop solution
As you read the code below, try to identify usage of the map or filter patterns.
public static double averageSalesAfter2000(List<Album> albums) {
long sum = 0;
int albumsAfter2000 = 0;
for (Album current : albums) {
if (album.getYear() > 2000) {
long sales = album.getSales();
sum = sum + sales;
albumsAfter2000 = albumsAfter2000 + 1;
}
}
if (albumsAfter2000 > 0) {
return sum / albumsAfter2000;
} else {
return 0;
}
}
A streams solution
The same problem can be solved using streams. The code is below, and an explanation of each line follows.
public static double averageSalesAfter2000(List<Album> albums) {
OptionalDouble result = albums.stream() // Stream<Album>
.filter(a -> a.getYear() > 2000) // Stream<Album>
.mapToLong(a -> a.getUnitsSold()) // LongStream
.average(); // OptionalDouble
// After filtering, there may not be any albums left.
// In that case, we just return 0.
return result.orElse(0);
}
In the code above, we have organised a series of computations into a stream pipeline.
- We first call
stream
on the list of albums, to turn it into a stream of albums (Stream<Album>
). This step is necessary to be able to call the other stream operations. - Then, we use
filter
to filter down to albums released after the year 2000. We define the condition as a lambda (aPredicate
), passed as a parameter tofilter
. - Then, we use
mapToLong
to go from a collection ofAlbum
objects to a collection ofLong
values. We could’ve usedmap
here instead ofmapToLong
, but usingmapToLong
means that we get back a stream whose static type isLongStream
. This means we have access to a number of useful numerical operations, likeaverage
. - The
LongStream
provides anaverage
method, which we can use to compute the average of the items remaining in the stream. This gives us anOptionalDouble
in return.OptionalDouble
is a class in Java representing adouble
which may or may not exist.- The reason this double may not exist is that, if the list is empty after filtering, we can’t compute an average, because you can’t divide by 0.
- We get the computed average from the
OptionalDouble
object and return the value.- The
orElse
method on theOptionalDouble
gets us the computed value if it exists, or it gives us a specified “backup” value otherwise.
- The
The Streams API provides a whole host of operations that can performed on streams of data.
filter
, map
, or specialised maps like mapToLong
are just the tip of the iceberg.
You can explore the API at the Stream
JavaDoc page.
Streams are not data structures
A stream, by itself, does not store data, and is technically not a data structure. Streams are wrappers around a data source. They allow us to define a series of operations that should be performed on that data source, and they make bulk processing of data convenient and fast.
The “data source” for a stream can be anything—an array or list, a file stored on disk, a stream of data coming from some external service, etc. In this class, we will only deal with streams based on lists or arrays, but this section describes how Streams might be used to work with other types of data sources.
A stream never modifies its underlying data source.
For example, you cannot use stream operations on a list to remove items from or add items to the list.
Just like you can’t add or remove items from a list while looping over it using a for-each
loop.
A stream pipeline usually consists of 3 pieces:
- A data source, which can be an array, a list, a file, etc.
- Zero or more intermediate operations, each of which transforms the stream into another stream. Because these intermediate operations return streams themselves, they can be chained together to perform a number of operations.
- Exactly one terminal operation, which produces a result or a side effect. Since the terminal operation “exits” the pipeline, no further stream operations can be added to the pipeline. That is, the terminal operation is always the last operation in a stream pipeline.
In our example above,
albums
was the source of the streamfilter
andmapToLong
were intermediate operationsaverage
was a terminal operation
Stream pipelines are lazy. A stream pipeline will not begin executing until it has to. Specifically, the stream processing won’t be “kicked off” until a terminal operation is called.
For example, if we had only called filter
and mapToLong
above, we would still be left with a LongStream
, i.e., a stream of longs.
No processing would take place unless some terminal operation was added to the pipeline.
Some examples of terminal operations are:
- Collecting the result of the stream pipeline into a list (
.toList()
).
List<Double> albumCosts = albums.stream()
.filter(a -> a.getYear() > 2000)
.map(a -> a.getPrice())
.toList();
- Counting the elements left in the stream after the intermediate operations have been performed (
count()
)
int albumsBefore2000 = albums.stream()
.filter(a -> getYear() < 2000)
.count();
- Looping over the elements in the stream and operating on them, i.e., applying a
Consumer
to each item (.forEach(Consumer)
)
// Reduce cost of pre-2000 albums by 10%
albums.stream()
.filter(a -> a.Year() < 2000)
.forEach(a -> a.setPrice(a.getPrice() * 0.9));
- Finally, as we’ve seen above, you can perform numerical aggregations (like
.sum()
or.average()
) when you have primitive streams likeIntStream
,DoubleStream
,LongStream
.
What is “streaming”?
You likely already know the meaning of the word “streaming”. For example, you’ve heard of “streaming music” or “streaming a video” over the internet. To simplify it greatly, it means to process data while it loads, rather than to load all the data before beginning to process it.
For example, when you’re streaming a movie on Netflix, you’re not actually downloading the whole movie to your machine and then watching it. Rather, chunks of the movie are being sent to your computer and played in your browser as they arrive.
Stream
s in Java are a similar idea.
This can be a useful mode of operation when you are working with huge amounts of data that cannot all be loaded into memory at once, or if you are working with “never-ending data”, for example, minute-by-minute readings from weather sensors.
In these situations, you cannot wait to load all the data into, say, an ArrayList
before you begin processing the data.
Consider the following scenario.
Let’s imagine you need to read and process data from a HUGE file on your hard disk: MyGiantFile.txt
The file is too large for you read the entire thing into a list of strings.
One way you could do this is to use a Scanner
to read the file and process it line by line, like we have done in a project and a couple of labs this term.
Scanner scanner = new Scanner(new File("MyGiantFile.txt"));
while (scanner.hasNext()) {
String line = scanner.nextLine();
// Assume we do some work with the line here
}
With the streams API, we can now concisely define operations like the above using lambdas and all the benefits they bring.
The Files.lines
static method creates a stream of strings, allowing us to define a pipeline of operations that will apply to each line in MyGiantFile.txt
.
Files.lines(Path.of("MyGiantFile.txt"))
.map(line -> .....)
.filter(line -> ........)
.forEach(line -> .......);
Because Files.lines
returns a Stream<String>
, the lines in the file and streamed through our pipeline, but this detail is abstracted away from you, the developer.
If you use a simple collection in memory (like an array or list) as the source of a stream, you’re not gaining much in the way of “streaming” — in that situation, the Streams API mostly provides a convenient library and syntax for performing operations on a collection data. Still pretty good!