COVID-19 and machine learning (appeared in May 2020)

(link to main website)

Covid-19 and machine learning

With current clinical and computational ability, we should do better than with earlier pandemics, says S.Ananthanarayanan.

How the world is dealing with the pandemic has many dimensions. Apart from advanced understanding of the interior of cells and the mechanism of viral action, we now have sophisticated microscopic imaging and computing ability, and instant communication, for widespread cooperation.

The other modern tool that we have is machine learning – or the use of computers to analyse data, where the computers teach themselves to improve the accuracy of patterns in data that they discern. Machine learning uses the ability of computers to carry out massive computations to imitate the way neuron circuits in animal brains adapt and train themselves to sensitive pattern recognition.

A paper in the journal, Nature Machine Intelligence, by Li Yan, Hai-Tao Zhang, Jorge Goncalves, Yang Xiao, Maolin Wang, Yuqi Guo, Chuan Sun, Xiuchuan Tang, Liang Jing, Mingyang Zhang, Xiang Huang, Ying Xiao, Haosen Cao, Yanyan Chen, Tongxin Ren, Fang Wang, Yaru Xiao, Sufang Huang, Xi Tan, Niannian Huang, Bo Jiao, Cheng, Yong Zhang, Ailin Luo, Laurent Mombaerts?, Junyang Jin, Zhiguo Cao?, Shusheng Li?, Hui Xu? and Ye Yuan, from different departments in the Tongji Medical College, Schools of AI, Engineering and Information Science of the Huazhong and the Wuhan Universities of Science and Technology, Wuhan, China, Centre for System Biomedicine, Luxembourg and the University of Cambridge, describes a method of early and accurate assessment of the course a case of COVID-19 would take. Such assessment helps optimize the use of available facilities by speedy segregation of persons testing positive, into groups that need different levels of care.

A simple application of machine learning is regression, or predictions based on past trends. An example, from an online presentation of machine learning by the Stanford University, is of estimating the price of a house from data of the floor area, number of rooms, bathroom, etc. The ‘learning data’ is real information collected by a survey of houses that have been bought or sold. Considering only the covered area, the data in respect of a sample of six houses could be like in Table 1. The same data is shown in the graph, where the prices are plotted against the areas.

The line that passes through the trend shown by the points could then be used to indicate nearly the correct price of a seventh house, either for the seller to decide what price to ask or for the buyer to decide what price is reasonable.

Machine learning uses a formal method to work things out from data like this. The price of a house is taken to be some base price plus an amount that depends on the area, like this:

Now, the idea is to discover the values of ‘Base’ and ‘Rate’ that best fit the data that we have. This is done by first working out a ‘cost function’. The cost function is the difference of the price of each house, as shown by some assumed values of ‘Base’ and ‘rate’, and the actual price. The total of this cost, for all the houses, gives us the cost function. Now is when the computer gets active. It rapidly works out the different values of the cost function when the values of ‘Base’ and ‘rate’ are varied. Different values are tried out, till we arrive at the lowest value for the cost function. These are then the values of ‘Base’ and ‘rate’ that most closely match the available data

The formula can then predict what the price may be of the seventh, eighth houses, and so on. When these houses are actually sold, and the real values are known, they can be added to the table and the estimation improved, till the formula becomes stable and reliable.

The price can again be worked out by assuming values for the factors, and the best fit with actual data discovered by calculating the ‘cost function’ and finding the optimum ‘factors’.

Another kind of problem that Machine Learning deals with is of ‘classification’. The example used in the Stanford study is of classifying tumours as malignant of benign. The example uses size of the tumour as the relevant feature. The data then consists of tumours of different sizes, and whether they turned out to be malignant or benign. Here, the answer that we are seeking is not one that can take all values, like the price, but just a Yes or a No – or a classification of tumours based on size.

The data and graph would be like in Table 2.

In practice, of course, whether a tumour is malignant depends on many factors, like the features that decide the price of a house. The ‘value’ derived from the complex data is expressed in a way that it is not a continuous number, but either ‘1’ or ‘2’, for ‘true’ or ‘false’ and a ‘cost function’, of how far predictions are from facts, is similarly worked out, to arrive at a ‘decision boundary’, to alerting the physician if there are further tests she needs to do.

The animal brain also does complex classification, but the method is not to work out a ‘cost function’, it is strengthen or eliminate responses to stimuli, depending on experience of outcome. For instance, if a bird finds that light colour and a grainy feel has been an edible tidbit on many occasions, the response of pecking is strengthened. But if the same colour with a shiny feel was a pebble, the bird learns not to peck. The process is simulated in the computer, with the probability of a classification increased when combinations of features led to correct results, and the method is able to quickly become very expert – Examples are of computers driving cars in traffic or of beating grand masters at chess.

The group of scientists working at Wuhan have used methods like this to study a sample of 375 patients who tested positive for COVID-19. The initial symptom were fever, cough, fatigue and breathing problems. The 75 features considered included basic information, symptoms, blood samples and the results of laboratory tests, including liver function, kidney function, coagulation function, electrolytes and inflammatory factors. Against these features, recorded early in the infection, was the final result – recovery or mortality. And machine learning used the data to develop a scheme that identified the crucial biomarkers that were associated with the most serious prognosis. The scheme was then applied to 110 fresh instances of patients and the results were found to be more than 90% accurate.

The results, in short, are that there are three vital factors: levels of (i) lactic dehydrogenase (LDH), which reflects tissue damage (ii) lymphocytes, or white blood cells and (iii) high-sensitivity C-reactive protein, hs-CRP, which reflects the state of inflammation. Identifying these factors gives the medical team advance pointers of what procedures to adopt, leading to both selection of the patients as well as giving selected patients the treatment that is most likely to help.

The significance of the work is twofold, the paper says. Apart from pointing out the high-risk factors. “It provides a simple and intuitive clinical test to precisely and quickly quantify the risk that the patient faces.”
------------------------------------------------------------------------------------------ Do respond to : response@simplescience.in -------------------------------------------