Teaching Data Science to Non-Programmers
I developed a new course to teach data science to non-programmers. The course is taught as DSCI 549 as part of the USC Data Science program. The course is attended by graduate students with majors in geography, political science, communications, journalism, business, and education.
The publications and materials below give an overview of how the course was designed and structured.
Wood figurine of a computer operator. Yoruba people, Nigeria. Carved from the long brown and white thorns of the trunk of a silk cotton tree. Fitchburg Art Museum, 2017.
The course syllabus is available here.
• Teaching Big Data Analytics Skills with Intelligent Workflow Systems. Yolanda Gil. Proceedings of the Sixth Symposium on Educational Advances in Artificial Intelligence (EAAI), colocated with the National Conference of the Association for the Advancement of Artificial Intelligence (AAAI), Phoenix, AZ, 2016.
• Teaching Parallelism Without Programming: A Data Science Curriculum for Non-CS Students. Yolanda Gil. Proceedings of the Workshop on Education for High-Performance Computing (EduHPC), held in conjunction with the IEEE ACM International Conference on High Performance Computing (SC), New Orleans, LA, 2014.
There is a huge demand in learning data science, as it has emerged as a widely desirable skill in many areas. Although courses are now available on a variety of aspects of data analytics and big data, they require programming which many students lack. As a result, acquiring practical data analytics skills is out of reach for many students and professionals, particularly in the humanities and soft sciences, posing severe limitations to our ability as a society to take advantage of our vast digital data resources.
Our goal is that students with no programming background learn basic concepts of data science, so they can understand how to pursue data-driven research projects in their area and be in a better position to collaborate with computer scientists in such projects. Although it is always beneficial to learn programming, not every student is inclined to invest the time and effort to do so. And we believe that many students will get the motivation to take a programming class after taking this course.
Our focus has been to design a curriculum that teaches computing concepts above the level of particular programming languages and implementations. In addition, we teach a broader set of data science skills than other courses, including not only machine learning and databases but also parallel and distributed computing, semantics and metadata, and provenance. There is also more emphasis on end-to-end methods for data analysis, which include data pre-processing, data post-processing, and visualization. Students learn how to analyze different kinds of data, including text, images, spatio-temporal data, and network data.
Learning these concepts must be supplemented with practice. But how can students with no programming skills be able to see programs in action? A major component of the course is the use of an intelligent workflow system that enables students to practice complex data analysis concepts. We capture common analytic methods as multi-step computational workflows that are used by students for practice with real-world datasets within pre-defined lesson units. The workflows include semantic constraints that the system uses to assist the students to set up parameters correctly and to validate their workflows. This enables students to learn in the context of real-world and science-grade datasets and data analytics methods.
The following lecture materials are publicly available under a Creative Commons CC-BY license:
Additional course materials (homeworks, quizzes, and tests) are available upon request (email: [email protected]).
We gratefully acknowledge support from the US National Science Foundation (NSF) with award ACI-1355475.
Last updated: January 3, 2020