Fundamentals of Stream Processing by Andrade, Gedik, Turaga.

Created on 2021-02-11T18:12:12-06:00

Return to the Index

This card can also be read via Gemini.

Events have attributes.

Streams are a way to receive or send events.

Operators modify or redirect events coming across streams.

Stream processing is the totality of operators from the input to the output of the system.

Stream processing languages define the logical means a stream of events is sent to operators/kernels.

Stream processing runtimes perform the actual modifications; realizing streams across multiple threads or computing devices.

Definitions

Continuous Query: has a query, trigger and stop condition. Monitors data that comes across a feed and does something when an event of interest passes by.

Publish-Subscribe: publishes a message to all subscribers of a topic.

ETL: Extract, Transform, Load

SCADA: Supervisory Control and Data Acquisition

Chinese wall: an information barrier between two internal parties to ensure leakage does not occur and decisions are made based on secrets.

Windowing: operating on a small "window" of elements, ex. the current element and the last five seen, or each pair of elements, or a triplet of the current past and next elements.

Stream connections

Direct: a line of output events go to a line of input events

Fan-in: multiple lines of output events go to one line of input events

Fan-out: one line of input events goes to multiple lines of output events

Stream operators

Edge adapters: consumes outside data streams; provides inside data stream

Aggregator: consumes one or more streams and outputs a collated version of the stream; for example, events with random incoming timing become a stream of average "things" per second.

Splitting: consumes a stream and produces multiple different streams; ex. forwards messages to the proper departments

Merging: consumes multiple related streams; combines the related data points and produces a new stream

Mapping: algorithmic, logical changes to data on the stream.

Sequencing: re-order or delay events in the stream

Special flow adapters in SPL

Deduplicate: stores a list of keys it has witnessed, drops events if they contain the same key/value previously seen.

Filter: rejects/keeps events matching a condition

Aggregators: max, min, avg, ..

Punctuation: a special non-data container to indicate something to the stream; window and final punctuations; indicate the end of an incoming batch, a group to be aggregated, whatever. final punctuation means the stream is out of data and closing.

Tumbling Window: fills up, sends entire pile to processor on some trigger condition and clears

Sliding Window: fills up continuously, sends pile to processor on some trigger condition; pushes every element back and drops elements that fall off the end

Partitioning: separating to an output bucket based on some attribute; can be stacked with windows so ex. items are sorted to windows by type and then process a whole batch of items of the same type

Encapsulation: when a stream has to stop and run another subordinant stream processer to obtain it's own output

not in the book, but relevant -- Debounce: kind of like deduplication but removes similar events that happen too close to one another in time

Window policies

Count: windows flush after so many events enter them

Delta: some property is stored of the oldest entry of the window, items are accumulated until the newest item's same property is greater than the oldest by some amount

Time: window is flushed every time so much time has passed

Punctuation: windows flush on an explicit punctuation command