• news_banner


Spark Streaming data cleaning mechanism
(I) DStream and RDD
As we know, Spark Streaming computation is based on Spark Core, and the core of Spark Core is RDD, so Spark Streaming must be related to RDD as well. However, Spark Streaming doesn’t let users use RDD directly, but abstracts a set of DStream concepts, DStream and RDD are inclusive relationships, you can understand it as the decoration pattern in Java, that is, DStream is an enhancement of RDD, but the behavior is similar to RDD.
DStream and RDD both have several conditions.
(1) have similar tranformation actions, such as map, reduceByKey, etc., but also some unique, such as Window, mapWithStated, etc.
(2) all have Action actions, such as foreachRDD, count, etc.
The programming model is consistent.
(B) Introduction of DStream in Spark Streaming
DStream contains several classes.
(1) Data source classes, such as InputDStream, specific as DirectKafkaInputStream, etc.
(2) Conversion classes, typically MappedDStream, ShuffledDStream
(3) output classes, typically such as ForEachDStream
From the above, the data from the beginning (input) to the end (output) is done by the DStream system, which means that the user normally cannot directly generate and manipulate RDDs, which means that the DStream has the opportunity and obligation to be responsible for the life cycle of RDDs.
In other words, Spark Streaming has an automatic cleanup function.
(iii) The process of RDD generation in Spark Streaming
The life flow of RDDs in Spark Streaming is rough as follows.
(1) In InputDStream, the received data is transformed into RDD, such as DirectKafkaInputStream, which generates KafkaRDD.
(2) then through MappedDStream and other data conversion, this time is directly called RDD corresponding to the map method for conversion
(3) In the output class operation, only when the RDD is exposed, you can let the user perform the corresponding storage, other calculations, and other operations.