Fundamentals of Stream Processing by Andrade, Gedik, Turaga.
Created on 2021-02-11T18:12:12-06:00
Events have attributes.
Streams are a way to receive or send events.
Operators modify or redirect events coming across streams.
Stream processing is the totality of operators from the input to the output of the system.
Stream processing languages define the logical means a stream of events is sent to operators/kernels.
Stream processing runtimes perform the actual modifications; realizing streams across multiple threads or computing devices.
Definitions
Continuous Query: has a query, trigger and stop condition. Monitors data that comes across a feed and does something when an event of interest passes by.
Publish-Subscribe: publishes a message to all subscribers of a topic.
ETL: Extract, Transform, Load
SCADA: Supervisory Control and Data Acquisition
Chinese wall: an information barrier between two internal parties to ensure leakage does not occur and decisions are made based on secrets.
Windowing: operating on a small "window" of elements, ex. the current element and the last five seen, or each pair of elements, or a triplet of the current past and next elements.
Stream connections
Direct: a line of output events go to a line of input events
Fan-in: multiple lines of output events go to one line of input events
Fan-out: one line of input events goes to multiple lines of output events
Stream operators
Edge adapters: consumes outside data streams; provides inside data stream
Aggregator: consumes one or more streams and outputs a collated version of the stream; for example, events with random incoming timing become a stream of average "things" per second.
Splitting: consumes a stream and produces multiple different streams; ex. forwards messages to the proper departments
Merging: consumes multiple related streams; combines the related data points and produces a new stream
Mapping: algorithmic, logical changes to data on the stream.
Sequencing: re-order or delay events in the stream
Special flow adapters in SPL
Deduplicate: stores a list of keys it has witnessed, drops events if they contain the same key/value previously seen.
Filter: rejects/keeps events matching a condition
Aggregators: max, min, avg, ..
Punctuation: a special non-data container to indicate something to the stream; window and final punctuations; indicate the end of an incoming batch, a group to be aggregated, whatever. final punctuation means the stream is out of data and closing.
Tumbling Window: fills up, sends entire pile to processor on some trigger condition and clears
Sliding Window: fills up continuously, sends pile to processor on some trigger condition; pushes every element back and drops elements that fall off the end
Partitioning: separating to an output bucket based on some attribute; can be stacked with windows so ex. items are sorted to windows by type and then process a whole batch of items of the same type
Encapsulation: when a stream has to stop and run another subordinant stream processer to obtain it's own output
not in the book, but relevant -- Debounce: kind of like deduplication but removes similar events that happen too close to one another in time
Window policies
Count: windows flush after so many events enter them
Delta: some property is stored of the oldest entry of the window, items are accumulated until the newest item's same property is greater than the oldest by some amount
Time: window is flushed every time so much time has passed
Punctuation: windows flush on an explicit punctuation command