Section outline

  • These lectures provide the theoretical and practical bases for storing and effectively processing large volumes of data: collecting, retrieving, accessing Big Data. 

    We will first study how to analyze, organize and present Big Data in order to address their specific challenges: reduce the complexity, process the data deluge in real time, propose new paradigms to allow the extraction of relevant knowledge. The course will then introduce the state-of-art Big Data computing platforms with the focus on how to utilize them in processing (managing and analyzing) massive datasets. Specifically, we will discuss the Apache Hadoop MapReduce and Apache Spark frameworks, which provide the most accessible and practical means of computing with large datasets in the Cloud. 

    • Big Data overview (definition, characteristics)
    • Storage models for Cloud (Binary Large Objects: Amazon S3, Azure Blobs), NoSQL(Google BigTable, Cassandra), disk storage (GoogleFS, HDFS, PVFS, Lustre), in-memory storage (key-value stores, hybrid systems: memecached, mongoDB)
    • Batch vs. stream processing
    • Consistency models
    • Big Data processing models: MapReduce
    • Big Data platforms: Apache Hadoop, Apache Spark

    Lecturer: Alexandru Costan alexandru.costan@insa-rennes.fr