apache-spark - topics - spark submit kafka




How to save latest offset that Spark consumed to ZK or Kafka and can read back after restart (3)

I am using Kafka 0.8.2 to receive data from AdExchange then I use Spark Streaming 1.4.1 to store data to MongoDB.

My problem is when I restart my Spark Streaming Job for instance like update new version, fix bug, add new features. It will continue read the latest offset of kafka at the time then I will lost data AdX push to kafka during restart the job.

I try something like auto.offset.reset -> smallest but it will receive from 0 -> last then data was huge and duplicate in db.

I also try to set specific group.id and consumer.id to Spark but it the same.

How to save the latest offset spark consumed to zookeeper or kafka then can read back from that to latest offset?




To add to Michael Kopaniov's answer, if you really want to use ZK as the place you store and load your map of offsets from, you can.

However, because your results are not being output to ZK, you will not get reliable semantics unless your output operation is idempotent (which it sounds like it isn't).

If it's possible to store your results in the same document in mongo alongside the offsets in a single atomic action, that might be better for you.

For more detail, see https://www.youtube.com/watch?v=fXnNEq1v3VA





kafka-consumer-api