Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Big Data Computing

Big Data phenomenon

  • Technological progress

    • storage capacity
    • communication bandwidth
    • computing power
    • Reduction of ICT costs
  • Digital Universe

    • Integration of digital technologies in every human activity
    • Scientific research (produces a lot of data)
    • Exponential growth of data
  • Data can be either structured (database records) or unstructured (textual data)

Application Domains

  • The analysis of large datasets arises in:
    • Retailing: product improvement, recommandation systems
    • Banking/Finance: fraud detection...
    • Telecommunications: user profiling
    • Science: validation methods
    • Medicine: diagnosis/therapy
    • Social studies: IOT

The Four V's of DATA

  1. Volume
    • size of data poses several computational challenges and requires a data-centric perspective
  2. Velocity
    • the data arrives at such high rate that tey cannot be stored and processed offline, but need to be processed in streaming
  3. Variety
    • large datasets often come unconstructed and may relate to very different scenarios
  4. Veracity
    • large datasets coming form real-word applications are likely to contain noisy, uncerain data
  • All points above require a paradigm shift with respect to traditional computing

Course presentation

Main objectives

  • Novel computing/programming frameworks for big data processing: theory and practice
    • Spark
  • A sample of key primitives for data analysis
    • Rigorous setting (be able to analitically predict what's going to happen)
    • Algorithmic solutions with focus on large inputs

Specific Content

  • Computational Frameworks: MapReduce, Apache Spark
  • Clustering primitives (Professor's focus)
  • Graph analysis primitives
  • Association analysis primitives (Data mining)
  • Data stream processing

Evaluation

  • Written exam (26 points)
  • Homeworks (6+1 points)
    • groups of max 3/4 sudents
    • 4 assignments, one every 2/3 weeks
    • Use of Apache Spark on individual PCs (assignments 1-3) and CloudVeneto (assignment 4)

Online tools

You can’t perform that action at this time.