Published onMay 20, 2024Real-Time Data Pipeline with Apache Kafka, Spark, and HiveBig-DataData-ScienceCloudThis article outlines a scalable, Docker-based architecture for handling data streams from Reddit, processing them with Apache Kafka and Spark, and storing the results in Apache Hive for analytical querying
Published onJanuary 20, 2024HDFS; Top-3 IPs for each hour of IP streamBig-DataData-ScienceMapReduce process with emphasis on each mapper and reducer step, environment configuration, and intermediate results generation