DATA420-24S2 (C) Semester Two 2024

Scalable Data Science

15 points

Details:
Start Date: Monday, 15 July 2024
End Date: Sunday, 10 November 2024
Withdrawal Dates
Last Day to withdraw from this course:
  • Without financial penalty (full fee refund): Sunday, 28 July 2024
  • Without academic penalty (including no fee refund): Sunday, 29 September 2024

Description

This course will introduce students to core topics in scalable data science based on distributed-computing techniques. This is a very practical course, with students learning by experimenting on a computer cluster.

This course will introduce students to topics in scalable data science based on distributed computing techniques. We will look at principles of distributed computing, topics in statistical modelling, and applications of distributed machine learning to find scalable solutions for real problems. This is a very practical course, and students will learn by experimenting on the university data science cluster with large datasets. All computing resources are available online using remote desktop and there will be interactive computer labs online for students who are not on campus. Enrolled students who take this course will have ongoing access to the data science cluster to pursue additional projects.

The intent of the course is to provide an environment that is very similar to what you will experience in a data science position in industry. You will need to understand the theory underlying common solutions to data science problems and how to implement these using a distributed computing framework such as Spark.

Learning Outcomes

  • Concrete learning outcomes will include the following:

  • Demonstrate knowledge of the need and use cases for distributed computing and the early development of Hadoop, MapReduce, and HDFS.
  • Demonstrate knowledge of the MapReduce programming model.
  • Demonstrate knowledge of statistical modeling and machine learning algorithms and how they can be applied to scalable data science problems.
  • Demonstrate knowledge of applications involving scalable data science in general.
  • Implement basic data processing using the MapReduce programming model.
  • Implement basic data analysis in Spark using the Spark Python API and the Spark SQL API.
  • Develop data analysis and modeling pipelines using Spark.
  • Develop practical solutions for real world data science problems that require data analysis, statistical modeling, and machine learning.

Prerequisites

Subject to approval of the Head of Department of Mathematics and Statistics.

Timetable 2024

Students must attend one activity from each section.

Lecture A
Activity Day Time Location Weeks
01 Monday 09:00 - 10:00 Jack Erskine 031 Lecture Theatre
15 Jul - 25 Aug
9 Sep - 20 Oct
Lecture B
Activity Day Time Location Weeks
01 Wednesday 09:00 - 11:00 A4 Lecture Theatre
15 Jul - 25 Aug
9 Sep - 20 Oct
Computer Lab A
Activity Day Time Location Weeks
01 Thursday 15:00 - 16:00 Jack Erskine 035 Lab 2
15 Jul - 25 Aug
9 Sep - 20 Oct
02 Friday 12:00 - 13:00 Jack Erskine 035 Lab 2
15 Jul - 25 Aug
9 Sep - 20 Oct
03 Friday 10:00 - 11:00 Jack Erskine 035 Lab 2
15 Jul - 25 Aug
9 Sep - 20 Oct
04 Thursday 14:00 - 15:00 Jack Erskine 035 Lab 2
15 Jul - 25 Aug
9 Sep - 20 Oct

Course Coordinator

James Williams

Textbooks / Resources

No textbook required.

Indicative Fees

Domestic fee $1,110.00

* All fees are inclusive of NZ GST or any equivalent overseas tax, and do not include any programme level discount or additional course-related expenses.

For further information see Mathematics and Statistics .

All DATA420 Occurrences