Multimedia Analytics

The Multimedia Analytics group at XRCI is focused on addressing challenges on understanding, mining and utilizing large multimedia collections to solve real-world problems. The group draws from its expertise in computer vision, image processing, speech analysis, text analytics and large-scale system design to build solutions in the domains of education, healthcare, etc. Research in this direction includes:
• creating new machine learning tools to bridge the semantic gap between raw visual information and meaningful concepts
• scaling computer vision approaches to billions of multimedia documents
• exploring new approaches to fuse information from diverse sources, such as audio, video and text
• designing systems that allow users to easily browse, search and retrieve the most relevant multimedia content

The members of the Multimedia group publish regularly at top-tier conferences and journals. Senior members are involved in organizing technology conferences and workshops, apart from serving on several program committees. The group also collaborates closely with PARC, PARC-East and XRCE.

Project Themes

  • Education
  • Deep Learning
  • Weakly Supervised Learning
  • Making analytics simpler to use

Analytics for Education

Quality education is one of the pressing needs of the emerging markets, particularly India. XRCI believes that technology-enabled Massive Open Online Courses (MOOCs) and Open Education Resources (OERs) can be utilized to provide personalized educational experience based on students’ background, their learning behaviour and performance. XRCI is working towards building personalized recommendation systems that automatically create such customized video and/or text-based content.

XRCI’s ed-tech solutions are pilot ready. Please write to om.deshmukh@xerox.com to discuss pilot opportunities in further detail.

Some of the specific research problems are:

Instructional videos as the next-gen textbooks: Instructional videos are set to be the next-generation textbooks. Humans are efficient in non-linearly skimming through a textbook to estimate temporal flow of concepts. Textbooks also provide a ‘table of content’ and a ‘index’ of key concepts discussed in the book for easy and to-the-point navigation. But, video content by its very nature is very difficult to visualize and consume. On the other hand, video lectures contain two extra features beyond the lingual information: paralingual and extra-lingual information. We analyse the video content to automatically generate a topic-based table of content and a list of key words that capture the important concepts discussed in the video. The topics in the table of content and the keywords are hyperlinked with the video for ease of navigation. The lingual, paralingual and extra-lingual information is analysed to automatically generate video-pages analogous to the individual pages of a multi-page text document

Dynamic and personalized assessment: One of the challenges with instructional videos is that, while they provide the audio-visual feel of a classroom lecture, they lack the dynamic flavour of a classroom. Our research efforts are focused on generating automatic question answering sessions based on the student’s learning pattern and the feedback received by the instructor. The questions can be a mix of questions that need open-ended free-text answers as well as Multiple Choice Questions (MCQs) which have a deep motivation from the Item Response Theory (IRT) research. Our research efforts also involve expanding IRT to questions beyond the 1-step objective questions. Automatic evaluation of these free-form answers and quantifying student understanding is also an active area of research.

Concept Linking: Concept maps and knowledge maps, often used as learning materials, enable users to recognize important concepts and the relationships between them. Concept maps can be used to provide adaptive learning guidance for learners such as path systems for curriculum sequencing to improve the effectiveness of the learning process. Generation of concept maps typically involve domain experts, which makes it costly. We are building a framework for discovering concepts and their relationships, such as prerequisites and relatedness, by analysing content from textual sources such as a textbook and multimedia sources such as instructional videos.

Teaching Style Analysis: Any instructional content (textbook, tutorial, video lecture) is judged not just on the content but also on the style of delivery. More engaging content leads to better understanding and higher retention rates. In this research, we are focused on analysing the content for level-of-difficulty and concept-density but (in the case of instructional videos) also the speaking and teaching styles of the lecturer and how these dimensions impact the learnability.

Our research is published in leading conference such as Intelligent User Interface (IUI), Education Data Mining (EDM), Interspeech, Human Computer Interaction (HCI) and International Conference on Multimodal Interaction (ICMI).

more..


Analytics for Deep Learning

One of the challenges with visual analysis is the representation of images (or videos) in a manner amenable for computers to learn concepts. Deep Learning (DL) approaches have been successful in automatically learning such robust representations. XRCI is applying DL techniques to model the variety in multimedia content and in enhancing DL architectures to understand specific visual problems and extend it to text and audio domains.

Weakly Supervised Learning

The amount of multimedia data available online continues to grow exponentially. The recent surge in video-capturing devices has led to thousands of hours of video being captured every day. Automatic analysis of this large amount of content is a challenging research problem but has a wide range of applications, e.g.: deep semantic analysis for summarization, saliency detection, event recognition etc. for different business domains such as education, customer care, healthcare, transportation.

Any video data usually has two major components: audio and visual, which in turn lead to textual interpretation. Analysing these components simultaneously and in an inter-connected fashion will lead to a deep and thorough semantic understanding of the content. Another challenging component of audio-visual analysis is obtaining labelled data. Labelling several hours of video data and learning from them is an extremely tedious task. Thus, developing algorithms for learning from weak supervision is an extremely important research problem.

Multimedia Analytics researchers are building novel weakly supervised algorithms in this interesting area of research.

Making analytics simpler to use

At Xerox Research, easy-to-use, scalable and efficient mechanism to perform data analysis to provide insights and answers to business intelligence questions is important. Such a service would be valuable to both experts – by automating much of their work – and non-experts – by providing a mechanism to solve problems without having to gain analytics expertise. The system internally calculates the hardware and software required to optimally solve the problem within the given cost and time constraints. The results from the analysis can be easily interpreted by an expert to provide a recommendation to improve the customer’s business.