日本搞逼视频_黄色一级片免费在线观看_色99久久_性明星video另类hd_欧美77_综合在线视频

國內(nèi)最全I(xiàn)T社區(qū)平臺 聯(lián)系我們 | 收藏本站
阿里云優(yōu)惠2
您當(dāng)前位置:首頁 > 服務(wù)器 > 第16課:Spark Streaming源碼解讀之?dāng)?shù)據(jù)清理內(nèi)幕徹底解密

第16課:Spark Streaming源碼解讀之?dāng)?shù)據(jù)清理內(nèi)幕徹底解密

來源:程序員人生   發(fā)布時(shí)間:2016-06-15 08:24:56 閱讀次數(shù):3282次

本篇博客的主要目的是:
1. 理清楚Spark Streaming中數(shù)據(jù)清算的流程

組織思路以下:
a) 背景
b) 如何研究Spark Streaming數(shù)據(jù)清算?
c) 源碼解析

1:背景
Spark Streaming數(shù)據(jù)清算的工作不管是在實(shí)際開發(fā)中,還是自己動手實(shí)踐中都是會面臨的,Spark Streaming中Batch Durations中會不斷的產(chǎn)生RDD,這樣會不斷的有內(nèi)存對象生成,其中包括元數(shù)據(jù)和數(shù)據(jù)本身。由此Spark Streaming本身會有1套產(chǎn)生元數(shù)據(jù)和數(shù)據(jù)的清算機(jī)制。

2:如何研究Spark Streaming數(shù)據(jù)清算?

  1. 操作DStream的時(shí)候會產(chǎn)生元數(shù)據(jù),所以要解決RDD的數(shù)據(jù)清算工作就1定要從DStream入手。由于DStream是RDD的模板,DStream之間有依賴關(guān)系。
    DStream的操作產(chǎn)生了RDD,接收數(shù)據(jù)也靠DStream,數(shù)據(jù)的輸入,數(shù)據(jù)的計(jì)算,輸出全部生命周期都是由DStream構(gòu)建的。由此,DStream負(fù)責(zé)RDD的全部生命周期。因此研究的入口的是DStream。
  2. 基于Kafka數(shù)據(jù)來源,通過Direct的方式訪問Kafka,DStream隨著時(shí)間的進(jìn)行,會不斷的在自己的內(nèi)存數(shù)據(jù)結(jié)構(gòu)中保護(hù)1個(gè)HashMap,HashMap保護(hù)的就是時(shí)間窗口,和時(shí)間窗口下的RDD.依照Batch Duration來存儲RDD和刪除RDD.
  3. Spark Streaming本身是1直在運(yùn)行的,在自己計(jì)算的時(shí)候會不斷的產(chǎn)生RDD,例如每秒Batch Duration都會產(chǎn)生RDD,除此以外可能還有累加器,廣播變量。由于不斷的產(chǎn)生這些對象,因此Spark Streaming有自己的1套對象,元數(shù)據(jù)和數(shù)據(jù)的清算機(jī)制。
  4. Spark Streaming對RDD的管理就相當(dāng)于JVM的GC.

3:源碼解析
generatedRDDs:安照Batch Duration的方式來存儲RDD和刪除RDD。

// RDDs generated, marked as private[streaming] so that testsuites can access it @transient private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()

我們在實(shí)際開發(fā)中,可能手動緩存,即便不緩存的話,它在內(nèi)存generatorRDD中也有對象,如何釋放他們?不單單是RDD本身,也包括數(shù)據(jù)源(數(shù)據(jù)來源)和元數(shù)據(jù)(metada),因此釋放RDD的時(shí)候這3方面都需要斟酌。
釋放跟時(shí)鐘Click有關(guān)系,由于數(shù)據(jù)是周期性產(chǎn)生,所以肯定是周期性釋放。
因此下1步就需要找JobGenerator

  1. RecurringTimer: 消息循環(huán)器將消息不斷的發(fā)送給EventLoop
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
2.  eventLoop:onReceive接收到消息。
/** Start generation of jobs */ def start(): Unit = synchronized { if (eventLoop != null) return // generator has already been started // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock. // See SPARK-10125 checkpointWriter eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) } }
3.  processEvent:中就會接收到ClearMetadata和ClearCheckpointData。
/** Processes all events */ private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) } }
4.  clearMetadata:清楚元數(shù)據(jù)信息。
/** Clear DStream metadata for the given `time`. */ private def clearMetadata(time: Time) { ssc.graph.clearMetadata(time) // If checkpointing is enabled, then checkpoint, // else mark batch to be fully processed if (shouldCheckpoint) { eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true)) } else { // If checkpointing is not enabled, then delete metadata information about // received blocks (block data not saved in any case). Otherwise, wait for // checkpointing of this batch to complete. val maxRememberDuration = graph.getMaxInputStreamRememberDuration() jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration) jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration) markBatchFullyProcessed(time) } }
5.  DStreamGraph:首先會清算outputDStream,其實(shí)就是forEachDStream
def clearMetadata(time: Time) { logDebug("Clearing metadata for time " + time) this.synchronized { outputStreams.foreach(_.clearMetadata(time)) } logDebug("Cleared old metadata for time " + time) }
6.  DStream.clearMetadata:除清除RDD,也能夠清除metadata元數(shù)據(jù)。如果想RDD跨Batch Duration的話可以設(shè)置rememberDuration時(shí)間. rememberDuration1般都是Batch Duration的倍數(shù)。
/** * Clear metadata that are older than `rememberDuration` of this DStream. * This is an internal method that should not be called directly. This default * implementation clears the old generated RDDs. Subclasses of DStream may override * this to clear their own metadata along with the generated RDDs. */ private[streaming] def clearMetadata(time: Time) { val unpersistData = ssc.conf.getBoolean("spark.streaming.unpersist", true) // rememberDuration記憶周期 查看下RDD是不是是oldRDD val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration)) logDebug("Clearing references to old RDDs: [" + oldRDDs.map(x => s"${x._1} -> ${x._2.id}").mkString(", ") + "]") //從generatedRDDs中將key清算掉。 generatedRDDs --= oldRDDs.keys if (unpersistData) { logDebug("Unpersisting old RDDs: " + oldRDDs.values.map(_.id).mkString(", ")) oldRDDs.values.foreach { rdd => rdd.unpersist(false) // Explicitly remove blocks of BlockRDD rdd match { case b: BlockRDD[_] => logInfo("Removing blocks of RDD " + b + " of time " + time) b.removeBlocks() //清算掉RDD的數(shù)據(jù) case _ => } } } logDebug("Cleared " + oldRDDs.size + " RDDs that were older than " + (time - rememberDuration) + ": " + oldRDDs.keys.mkString(", ")) //依賴的DStream也需要清算掉。 dependencies.foreach(_.clearMetadata(time)) }
7.  在BlockRDD中,BlockManagerMaster根據(jù)blockId將Block刪除。刪除Block的操作是不可逆的。
/** * Remove the data blocks that this BlockRDD is made from. NOTE: This is an * irreversible operation, as the data in the blocks cannot be recovered back * once removed. Use it with caution. */ private[spark] def removeBlocks() { blockIds.foreach { blockId => sparkContext.env.blockManager.master.removeBlock(blockId) } _isValid = false }

回到上面JobGenerator中的processEvent
1. clearCheckpoint:清除緩存數(shù)據(jù)。

/** Clear DStream checkpoint data for the given `time`. */ private def clearCheckpointData(time: Time) { ssc.graph.clearCheckpointData(time) // All the checkpoint information about which batches have been processed, etc have // been saved to checkpoints, so its safe to delete block metadata and data WAL files val maxRememberDuration = graph.getMaxInputStreamRememberDuration() jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration) jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration) markBatchFullyProcessed(time) }
2.  clearCheckpointData:
def clearCheckpointData(time: Time) { logInfo("Clearing checkpoint data for time " + time) this.synchronized { outputStreams.foreach(_.clearCheckpointData(time)) } logInfo("Cleared checkpoint data for time " + time) }
3.  ClearCheckpointData: 和清除元數(shù)據(jù)信息1樣,還是清除DStream依賴的緩存數(shù)據(jù)。
private[streaming] def clearCheckpointData(time: Time) { logDebug("Clearing checkpoint data") checkpointData.cleanup(time) dependencies.foreach(_.clearCheckpointData(time)) logDebug("Cleared checkpoint data") }
4.  DStreamCheckpointData:清除緩存的數(shù)據(jù)
/** * Cleanup old checkpoint data. This gets called after a checkpoint of `time` has been * written to the checkpoint directory. */ def cleanup(time: Time) { // Get the time of the oldest checkpointed RDD that was written as part of the // checkpoint of `time` timeToOldestCheckpointFileTime.remove(time) match { case Some(lastCheckpointFileTime) => // Find all the checkpointed RDDs (i.e. files) that are older than `lastCheckpointFileTime` // This is because checkpointed RDDs older than this are not going to be needed // even after master fails, as the checkpoint data of `time` does not refer to those files val filesToDelete = timeToCheckpointFile.filter(_._1 < lastCheckpointFileTime) logDebug("Files to delete:\n" + filesToDelete.mkString(",")) filesToDelete.foreach { case (time, file) => try { val path = new Path(file) if (fileSystem == null) { fileSystem = path.getFileSystem(dstream.ssc.sparkContext.hadoopConfiguration) } fileSystem.delete(path, true) timeToCheckpointFile -= time logInfo("Deleted checkpoint file '" + file + "' for time " + time) } catch { case e: Exception => logWarning("Error deleting old checkpoint file '" + file + "' for time " + time, e) fileSystem = null } } case None => logDebug("Nothing to delete") } }

至此我們也知道了清算的進(jìn)程,全流程以下:
這里寫圖片描述

但是清算是甚么時(shí)候被觸發(fā)的?
1. 在終究提交Job的時(shí)候,是交給JobHandler去履行的。

private class JobHandler(job: Job) extends Runnable with Logging { import JobScheduler._ def run() { try { val formattedTime = UIUtils.formatBatchTime( job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false) val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}" val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]" ssc.sc.setJobDescription( s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""") ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) // We need to assign `eventLoop` to a temp variable. Otherwise, because // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then // it's possible that when `post` is called, `eventLoop` happens to null. var _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobStarted(job, clock.getTimeMillis())) // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK⑷835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { job.run() } _eventLoop = eventLoop if (_eventLoop != null) { //當(dāng)Job完成的時(shí)候,eventLoop會發(fā)消息初始化onReceive _eventLoop.post(JobCompleted(job, clock.getTimeMillis())) } } else { // JobScheduler has been stopped. } } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } } } }
2.  OnReceive初始化接收到消息JobCompleted.
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start()
3.  processEvent:
private def processEvent(event: JobSchedulerEvent) { try { event match { case JobStarted(job, startTime) => handleJobStart(job, startTime) case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime) case ErrorReported(m, e) => handleError(m, e) } } catch { case e: Throwable => reportError("Error in job scheduler", e) } }
4.  調(diào)用JobGenerator的onBatchCompletion方法清楚元數(shù)據(jù)。
private def handleJobCompletion(job: Job, completedTime: Long) { val jobSet = jobSets.get(job.time) jobSet.handleJobCompletion(job) job.setEndTime(completedTime) listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo)) logInfo("Finished job " + job.id + " from job set of time " + jobSet.time) if (jobSet.hasCompleted) { jobSets.remove(jobSet.time) jobGenerator.onBatchCompletion(jobSet.time) logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format( jobSet.totalDelay / 1000.0, jobSet.time.toString, jobSet.processingDelay / 1000.0 )) listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo)) } job.result match { case Failure(e) => reportError("Error running job " + job, e) case _ => } }

觸發(fā)流程以下:
這里寫圖片描述

本課程筆記來源于:
這里寫圖片描述

生活不易,碼農(nóng)辛苦
如果您覺得本網(wǎng)站對您的學(xué)習(xí)有所幫助,可以手機(jī)掃描二維碼進(jìn)行捐贈
程序員人生
------分隔線----------------------------
分享到:
------分隔線----------------------------
關(guān)閉
程序員人生
主站蜘蛛池模板: 成人av在线网站 | 国产精品电影网 | 99国产精品99久久久久久粉嫩 | 一区二区三区欧美日韩 | av成人免费 | 美女操人网站 | 在线观看免费av网 | 999久久久精品视频 国产第91页 | 3344成人免费高清免费视频 | 久久九色 | 99国产在线播放 | av网站黄色 | 国产一区久久久 | 老牛影视av一区二区在线观看 | 日日操夜夜爽 | 高清不卡一区二区 | 91成人在线播放 | 黄色三级免费看 | 精品国产免费一区二区三区四区 | av黄色免费 | 国产成人精品在线观看 | 国产一区二区在线看 | 一区二区美女 | 久久久午夜精品 | 久久久久高清 | 国产在线小视频 | 一区二区免费 | 日韩精品三区 | 久久新 | 国产中文一区 | 欧美激情视频一区二区三区不卡 | 国产一级免费 | 久久成人亚洲 | av网站免费看 | 亚洲高清在线观看 | 久久久噜噜噜久久中文字幕色伊伊 | 蜜臀网| 久久国产成人精品av | 国产性色av | 超碰在 | 高潮久久久 |