Dataflow streaming - does it fit our use case? -
we've been using dataflow in batch mode while now. however, can't seem find info on streaming mode.
we have following use case:
- data/events being streamed real-time bigquery
- we need transform/clean/denormalize data before analysis business
now, of course use dataflow in batch mode, , take chucks of data bigquery (based on timestamps), , transform/clean/denormalize way.
but that's bit of messy approach, because data being streamed real-time , real gnarly working out data needs worked on. sounds brittle too.
it great if transform/clean/denormalize in dataflow, , write bigquery as it's streaming in.
is dataflow streaming intended for? if so, data source can dataflow read in streaming mode?
yes, reasonable use case streaming mode. support reading cloud pub/sub via pubsubio
source. additional sources in works. output can written bigquery via bigqueryio
sink. pcollection
docs cover distinction between bounded , unbounded sources/sinks, available concrete implementations.
as apparent lack of streaming-specific documentation, majority of unified model applicable in batch , streaming, there no streaming-specific section. said, i'd recommend looking on windowing , triggers sections of pcollection
docs, particularly applicable when dealing unbounded pcollection
s.
Comments
Post a Comment