Machine Learning and Statistics

A part of the Data Analytics lab, the Machine Learning and Statistics group looks at new machine learning algorithms and statistical techniques for predictive analytics. The group aims to understand large amounts of unstructured data generated across diverse Xerox Services businesses to derive new actionable insights.

The research group publishes regularly in top conferences in machine learning and statistics as well as domain-specific conferences such as healthcare and transportation. The research group has several collaborative projects with academia in India and abroad. A sampling of the projects that the group is working on is presented below:

Project Themes

  • Healthcare
  • Transportation
  • Making Analytics Simpler


The availability of digitized clinical data through Electronic Medical Records (EMR) in hospitals is increasing throughout the world. This data presents an unprecedented opportunity to study and gain deeper understanding of diseases, develop new treatments and improve healthcare ecosystems. There is also tremendous interest in identifying high risk individuals early, preferably much before the onset of disease, to provide preventive care and considerably reduce the clinical and economic burden of healthcare.

The aim of our project is to develop statistical models and algorithms for predictive analytics in hospitals that can identify high-risk patients in hospitals. Under this broad umbrella, we have been working on specific problems which include:

- ICU admission prediction where the aim is to identify patients in the hospital floor who are at risk of deterioration and potential ICU admission
- Complication prediction in ICUs for specific complications like Acute Hypotensive Episodes, Acute Respiratory Failure
- ICU mortality prediction to predict the risk of death in ICU patients
- Stroke severity and outcome prediction for automatically distinguishing between mild and severe cases of stroke, and for effective personalized treatment planning

Identification of high risk patients (e.g. in each of the above cases) can potentially lead to prioritized care, better hospital resource management and reduced healthcare costs. We collaborate with two Indian hospitals for data acquisition as well as consultancy with doctors. In addition we have ongoing collaboration with Midas+, Xerox’s healthcare provider services group.

Healthcare data is usually a heterogeneous mix from several sources. Clinical data includes measurements of vital parameters like blood pressure, respiration rate, heart rate and temperature. These are typically measured periodically and thus can be viewed as time series data. Laboratory investigations are numerical measurements of the levels of important biomarkers. Radiology investigations are available as images accompanied with text reports. Nursing notes are a rich source of information – both subjective and objective – about patients’ condition. Information about medications, past conditions, comorbidities, addictions and socio-economic factors like income, education level etc. also provide vital clues about the patient.

Healthcare data presents several modelling and algorithmic challenges. Vital measurements are typically not uniformly sampled both within and across patients. Effective time series modelling in such conditions remains an open problem in machine learning. New techniques are required for better use of data from heterogeneous sources. We have explored new convex matrix factorization and copula based models to handle such heterogeneity. Uncertainty and noise in the data leads to inaccurate models that can result in erroneous decision making. Hence there is a need for designing robust classification models that can handle such data with known guarantees. Another major challenge in most healthcare datasets is that of data imbalance that results in classification models being severely biased and unable to identify test-data from the under-represented classes. We have developed new models and algorithms that can address many of these problems.


Analytics for transportation

Public transportation networks are increasingly becoming more intelligent with extensive automation. The automation has led to data being churned out from the daily operations of the transportation network. The data could be the smart card usage of commuters or could be GPS traces of buses running in the network. At XRCI, we look at mining this trove of data to produce meaningful insights into the travel patterns of commuters and the service levels of the transportation network. We invent new technology, at the interface of combinatorial optimisation and machine learning to produce next generation scheduling techniques that can optimise the operations of the transportation network.

Research threads:

1. Probabilistic graphical models to learn complex interdependency structures in transportation domain.
2. High dimensional regression to learn functions in a big data settings with an imbalance between the number of features and data.
3. Markov decision process to aid commuters to make meaningful decisions in making their daily commute.


Making analytics simpler to use

At Xerox Research, easy-to-use, scalable and efficient mechanism to perform data analysis to provide insights and answers to business intelligence questions is important. Such a service would be valuable to both experts – by automating much of their work – and non-experts – by providing a mechanism to solve problems without having to gain analytics expertise. The system internally calculates the hardware and software required to optimally solve the problem within the given cost and time constraints. The results from the analysis can be easily interpreted by an expert to provide a recommendation to improve the customer’s business.


The Analytics lab at XRCI collaborates with universities through internship programs, faculty sabbaticals, formal joint collaboration projects and pilots. XRCI currently is working closely with a number of academic institutes including Indian Institute of Science, Indian Institute of Technology (Delhi & Bombay), University of Helsinki, University of Texas Austin, Massachusetts Institute of Technology (MIT), Sai Vidya Institute of Technology (SVIT). These projects are areas like machine learning, speech processing, semantic analysis, cloud computing and include piloting of XRCI’s novel education solutions. XRCI also collaborates with hospitals and medical institutes such as Manipal Kasturba University, St. John’s Medical College and cancer hospitals to develop realistic algorithms for disease diagnostics.