Added to_kafka directly from a Dask worker by jsmaupin · Pull Request #279 · python-streamz/streamz

jsmaupin · 2019-10-09T16:17:40Z

Currently, we must gather() all the results from a Dask stream back to the master script and then push the results to Kafka. This removes all the benefits of parallel processing we get with Dask and Kafka. It would be much more efficient if we could push data directly from the Dask workers into Kafka.

One issue I had getting this to work is that the Producer class from the Confluent Kafka Python library is not pickle-able. The workaround is to hide the Producer from the pickle function by creating it using "reflection" methods and create it on the worker side. However, I believe this adds a requirement that the confluent_kafka library must be installed on the worker.

Also, this implementation is serial, but Dask itself is parallel.

codecov · 2019-10-09T16:30:03Z

Codecov Report

Merging #279 into master will decrease coverage by 1.07%.
The diff coverage is 26.92%.

@@            Coverage Diff             @@
##           master     #279      +/-   ##
==========================================
- Coverage   94.69%   93.61%   -1.08%     
==========================================
  Files          13       13              
  Lines        1620     1644      +24     
==========================================
+ Hits         1534     1539       +5     
- Misses         86      105      +19

Impacted Files	Coverage Δ
streamz/dask.py	`85.49% <26.92%> (-14.51%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6719c0...4860aee. Read the comment docs.

jsmaupin · 2019-10-09T16:43:00Z

I'll add tests to get coverage now.

chinmaychandak · 2019-11-08T06:39:11Z

@jsmaupin, did you get a chance to write some tests?

@martindurant, @CJ-Wright, Any thoughts on this?

jsmaupin · 2019-11-11T23:03:15Z

@chinmaychandak I suspect this can be done with the existing to_kafka method. We just need to figure out the difference between this implementation and the existing one and add an if dask: functionality. Would you have time to take a look at this?

chinmaychandak · 2019-11-12T02:01:01Z

Okay, let me take a look

chinmaychandak · 2019-11-22T22:02:22Z

We just need to figure out the difference between this implementation and the existing one and add an if dask: functionality.

@jsmaupin The if dask: implementation worked in from_kafka_batched since this was the first node wherein we needed to scatter the stream. In this case, however, the input would be a DaskStream.

Hence, I am thinking that this current implementation of yours would be the best way out?

CJ-Wright · 2019-11-22T22:16:54Z

streamz/dask.py

+
+        client = default_client()
+        result = client.submit(produce, self.topic, x, self.producer_config)
+        self._emit(result)


Should this call self.emit? That way it integrates with the async support?

Agreed. Anything else that comes to mind that needs change, @CJ-Wright?

chinmaychandak · 2020-03-16T16:01:20Z

@martindurant Could you also please take a look at this?

jsmaupin · 2020-03-19T18:30:42Z

Circling back to this. After becoming more familiar with everything here, I'm thinking that this should support the back pressure like the existing to_kafka function does.

martindurant · 2020-11-13T17:45:38Z

This PR got left behind. Does anyone remember the status here?

jsmaupin · 2020-11-13T17:48:39Z

I proposed a solution here: https://stackoverflow.com/questions/60764361/write-to-kafka-from-a-dask-worker that I felt was a bit of a hack. I haven't found a better solution. I would be happy to follow through with this implementation if there are no objections.

martindurant · 2020-11-13T17:53:28Z

I think that approach is totally fine; but maybe I would improve it by having a dict of producers, with the key being a hash of the connection kwargs, because you could have different kafka jobs live in a cluster at the same time.

Also, the attribute could as easily be a global variable in a module - especially if it's mutable like the dict I'm suggestign above. This seems cleaner to my mind, but I can't think of any technical reason that it's different (there is only one worker instance).

Added to_kafka directly from a Dask worker

4860aee

CJ-Wright reviewed Nov 22, 2019

View reviewed changes

jdye64 mentioned this pull request Dec 3, 2019

Added base Conda environment yml file #286

Merged

chinmaychandak mentioned this pull request Dec 4, 2020

to_kafka throughput #398

Open

Conversation

jsmaupin commented Oct 9, 2019

Uh oh!

codecov bot commented Oct 9, 2019

Codecov Report

Uh oh!

jsmaupin commented Oct 9, 2019

Uh oh!

chinmaychandak commented Nov 8, 2019

Uh oh!

jsmaupin commented Nov 11, 2019

Uh oh!

chinmaychandak commented Nov 12, 2019

Uh oh!

chinmaychandak commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CJ-Wright Nov 22, 2019

Choose a reason for hiding this comment

Uh oh!

chinmaychandak Mar 13, 2020

Choose a reason for hiding this comment

Uh oh!

chinmaychandak commented Mar 16, 2020

Uh oh!

jsmaupin commented Mar 19, 2020

Uh oh!

martindurant commented Nov 13, 2020

Uh oh!

jsmaupin commented Nov 13, 2020

Uh oh!

martindurant commented Nov 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chinmaychandak commented Nov 22, 2019 •

edited

Loading