
Hello,
Interesting, what receives and where do you keep flows at the other end of messaging bus ?
PS: in my case I am talking about hundreds of kilo flows/s that I would like to keep for at least few weeks, so MemSQL or any other SQLs are out of the picture.
Thank you
I've seen a lot of different approaches for people trying to build their own at that scale (taking off of a bus and storing for medium-long term analysis), so I'll share some data re: what I've seen (not specific to vFlow). MemSQL as shown is one option, and is super fast even multi-tenant for the in-ram row store. They have a to-disk column store as well but it is less optimized for massively indexed retrieval. Still, it's worth noting that it's not only an in-RAM solution. And it does batched inserts from row to column store so can keep up with pretty high ingest rates to diskful column store. Another option in the "native" SQL-y space is citusdb, though the high ingest rate was an issue last I looked, and it didn't have multi-tenancy/ rate-limiting support so any 1 monster query could slow everything down. Both MemSQL and Citus are commercial, though a lot of Citus functionality is OSS. And for just forensics (vs ad hoc fast querying for operational or BI purposes), they can be a good augment, though, but are well behind on performance vs. at least one commercial solution, especially for multi-tenant use (reports, peering analysis, spelunking via portal use, alerting, DDoS detection, etc all going on at once). There are plenty of Hadoop-ecosystem column stores as well that can take directly from Kafka or with light translation: Presto, Impala, Drill, and others. Most of them can do multi-column indexing and support SQL as an interface, but multi-tenancy support is also lacking and if you don't get indexes right, many kinds of queries can take minutes to hours over months of data (even from a relatively few routers). But they can all do multi hundred k FPS from Kafka. You'll also need to run a Hadoop cluster. And there HDFS-topped column store implementations running at pretty large scale. Spark I've never seen people stick with - it can compute real-time aggregates with streaming, and if you try to store from RAM to disk, it's less badly slow than Hadoop for map/reduce patterns, but it's slower than just about every column store for accessing trillions of records and doing specific sub-selections to query or dynamically aggregate. Clickhouse from Yandex is interesting but for flow people generally get hung up on its single column for indexing. It can scan VERY fast though, but that still puts it a bit better at 100% forensics use cases for the data scale you're asking about. The leading DIY option we see for store-all is actually the Elastic stack. There are still issues with security (everyone who can access the Elastic backend can access all of the data), and it can require a tremendous # of machines to keep it fast - easily tens of machines for hundreds of k FPS over months. But it's doable and can be pretty fast, if a bit less network-savvy. There's some support for storing prefixes now but still lacks some network savviness (projecting across AS paths, multi hop lookup for finding ultimate exit, flexibility in variable prefixlen querying) and you need to frontend with something like pmacct to do fusion and then build that into an HA architecture if it's really important. But there are a number of DIY setups we've seen that are Elastic-based - more than that are Hadoop/SQL-based. And then, the biggest flow store I know of (1 or 2 carriers may want to argue but I haven't seen theirs) is at DISA for DoD - > a decade of un-sampled flow coming from SiLK. All stored in hourly un-indexed files, essentially nothing but CLI to access, and cluster-able with work (there is a non-OSS add-on to do it). But it works and is pretty neat in its own way, which is optimized around again a forensics-only set of queries (vs. operations, BGP, peering, cost analytics and optimization also). And it can certainly ingest at more than the scale you're talking about and is pretty efficient in storing it on disk. And if you ran it on top of a big MapR-ish NFS cluster (no flames please, though I'm not completely joking) you can effectively cluster it. Still will be pretty slow for anything but time-bounded forensic queries. And then (separate topic and equally long potential survey) there are a new wave of streaming databases that can be used, which can consume directly from Kafka. If you don't mind having to pre-define queries, or using it to augment a column store, they can be MUCH more lightweight than any of the above options, though also lacking in some networking primitives. And if you're running on sampled flow already, the extra lack of precision might not be an issue (they pretty much all use probabalistic data structures like HLLs to do count and topN). And MemSQL can operate in that mode as well though I don't think that was how Mehrdad was showing it working with vFlow. But again you can't ever go 'back in time' for an ad hoc query with them so it's probably more interesting as an augment and offloader for most uses where you'd normally think of storing many billions or a few trillion flows. Happy flow-ing... Avi Freedman CEO, Kentik