Lightweight streaming to your data warehouse
Many of my data / gcp friends have been posting about Pubsub -> BigQuery direct streaming recently launched. You might wonder though what this is good for and why this is cool - well let me tell you 🙂
- There are plenty of times when you are ingesting data that's already being cleaned by some upstream process and then you want to save that data into a BQ table(s) for auditing, or other use cases such as training data for your ml models. 🧠
- Previously, the best(and scaleable) way to do this would be to write your own little dataflow job that would stream data to BQ 🧑💻
- If you found yourself doing this one too many times, you would likely then write a flex template that you could easily deploy and scale as your event types grew 💪
- The flex templates are awesome and you could just stop there but with every flex template deployment there is just a little bit more infra to manage (and yes, even though dataflow jobs are serverless, you still need to apply your SRE best practices for these pipelines!)
- There's also overhead in the cost of running the dataflow job. Although the job is simple in nature, as the event types grow, the number of these template deployments grows and that then grows the number of jobs you and the team needs to keep an eye on. All of this can is expensive! (true cost ~= infra cost + engineering effort) 💸
Direct streaming solves ALL of this. Not only does it remove the need for the intermediate dataflow job, but it also significantly reduces your true cost of operations. This all results in happy engineers, a more reliable product and significantly simpler architectures. BTW, your data can still be conformed to a schema with pubsub's schema feature (more on this in another post =))
Learn more about this here: https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics