What's Going On: Teaching Data Science with Data Science

March 28, 2017

Call it (data science)2. In a February, 2017 seminar, Jose Luis Ambite described a project that uses data science—a concept unifying statistics, data analysis, related methods and their results—to teach biomedical researchers about data science.

The goal: to create automated, personalized "knowledge maps" that display the specific learning resources each student will need to follow to achieve mastery of specific subjects. Programs are customized based on each user's profile, existing skills and knowledge. Ongoing updates will ensure underlying materials are refreshed continually, delivering access to state-of-the-art knowledge. In contrast, traditional education often relies on preselected materials throughout a course.

Ambite's talk to about 30 colleagues in Marina del Rey, California and Arlington, Virginia was the latest in ISI's "What's Going On" series, which deepens researchers' knowledge of work taking place Institute-wide. The data science initiative, known as the Educational Resource Discovery Index for Data Science (ERuDIte), also involves ISI researchers Lily Fierro, Kristina Lerman, Gully Burns, and Jonathan Gordon.

ERuDIte's approach begins with collecting numerous high-quality data science resources available online. Those resources range from single lectures to entire courses, in text and video formats. Sources include major massive open online courses (MOOCs) such as Coursera, Udacity and edX, along with tutorials and talks from videolectures.net, YouTube and other sources. Materials structured in a logical order for each subject, like a book table of contents or course syllabus, also are scraped from the web.

The system then describes and organizes each resource. ERuDIte now contains descriptions of 8,800 learning resources on data science, with data on those resources harmonized into a common schema.

Next, resources are described with a taxonomy of about 120 concepts organized along six hierarchical dimensions. (That taxonomy was generated semi- automatically by extracting key phrases from the resources, and from unsupervised topic modeling.) Machine learning and information retrieval techniques automatically tag each concept. Results of the web-based system, which is 73-83 percent accurate, are reviewed by human curators to ensure high quality. Ultimately, a recommendation engine will provide a personal learning path for each student.

Ambite's inspiration for the knowledge map was the Khan Academy, a popular educational website in which a team recommends its own online videos for kindergarten-to-college-level students. The Khan Academy manually constructs a map to show each video lecture's dependencies on other lectures - a labor- intensive process that clearly won't scale to web proportions. Automating the approach not only makes ERuDIte massively scalable, says Ambite, but the system easily could be adapted to many other data- and resource- intensive technical realms beyond its initial data science target.

After walking through each step in detail, Ambite closed with ERuDIte's current, three-pronged direction. First, the team is collecting more content such as sides and video transcripts, which include deploying speech recognition to transcribe videos. Second, they're working on automatically identifying high-quality data science videos from general sources, such as YouTube. And third, they're seeking to identify subject prerequisites automatically by exploiting the organization of existing academic resources.

Such a living, non-traditional educational resource will be ideally suited to teaching fast-moving technical subjects. Says Ambite, "All the resources are continually updated, so that the learning experience is always right on the cutting edge."

ERuDIte is funded by the National Institute of Health's "Big Data to Knowledge" (BD2K) initiative in biomedical data science research, methods and workforce development. ERuDIte is a core effort of the BD2K Training Coordinating Center led by Keck School of Medicine of USC associate professor Jack Van Horn. Other collaborators include the USC Stevens Neuroimaging and Informatics Institute, USC School of Cinematic Arts and Georgetown University, along with the training component of Elixir, a large pan-European effort on integrating life-science data across Europe.