There has been ample coverage on the fact that Coronavirus currently disrupting the world as we knew it was first spotted by a machine learning algorithm designed to do just that - pick up on virus outbreaks as well as on possible threats. Originally covered by this Wired article, followed up by a detailed discussion on the TWIML AI podcast, founder Kamran Khan explains how he came up with the idea of starting a company that would monitor disease situations around the world, and how this idea eventually led to his current company, BlueDot, and the red flag it raised on Dec 31, 2019, beating all other experts, authorities and the media by a number of days, if not weeks..
While the story is interesting in itself, we look at it from a slightly different angle. Without going too deep in the technical details, we are attempting to break down the ingredients of the machine learning application, looking for broader and more general takeaways for the necessary components of a successful AI model in the meantime.
So how they did it? What were they looking for? Let’s see.
First, an NLP Problem
As Karman Khan explained, there are three basic elements to BlueDot’s mechanics, the first of which is surveillance. The aim was to develop a system that monitors the world’s information of outbreaks and all events that can be interpreted as a threat that might lead to an outbreak.
In order to do this, they put together an application that automatically scans through professional websites, forums and other information sources that are available online in search for indications of an outbreak or a threat.
While the designers decided to steer clear of social media, because there’s too much noise there, the body of text they plough through each and every day is in the volume of 100K+ pages. The sheer amount is aggravated by the fact that the application does that in about 60 languages.
Well, of course, for machine learning applications to work well, you need plenty of data, right? Still the complexity and quantity of this data, and the complexity of the natural language processing problem that results - is incredible.
“I never knew that a heavy metal band Anthrax existed before we started to work with this NLP-application,” Khan said jokingly, and in the age of computer viruses we can only assume that trash metal band from the 80s was not their biggest challenge to overcome (on a related note, how can someone possibly not know about Anthrax?).
In BlueDot’s practice this almighty machine learning algorithm would pick about five content elements of the 100,000 every day, and it would flag them to the company’s experts for further analysis.
Which brings us to our first important takeaway when it comes to successful machine learning applications: however impressive the work done by the machine, BlueDot’s application is not a hundred percent automated. It is rather, in a hybrid fashion an augmented decision-making pattern in which the algorithm prepares everything it can and then allows actual people (seasoned professionals in the case of BlueDot) to step in and make the call. In other words, it does the heavy lifting (picking the five relevant content elements out of hundreds of thousands), and then politely steps aside only to allow experts to take it from there.
To illustrate the actual process, in COVID’s case, the algorithm flagged that, in Wuhan a couple of patients were reported, with pneumonia of unknown origins, The algorithm also knew that there was a market of live animals nearby. For the experts, it was sufficient information to know that if this is not an outbreak already, these are the perfect ingredients for one in the future. It turned out that they were absolutely right.
Second, a prediction problem
Say, we have an outbreak spotted or at least flagged. What next?
In BlueDot’s procedure, the second stage is to build models in order to predict the spread of the virus.
Wait, but how?
It turns out that BlueDot has two separate and vast data source: they have access to the (anonymized) airline ticket sales data, and similarly they can evaluate (also anonymized and merged) mobile phone location data.
In a 21st century society this information proves to be sufficient in modeling, and thus predicting the movement of people, at least when it comes to the most significant routes and masses.
Especially, since they analyze airline data from two ways: first, the route of the individual aircrafts (ie. the Boeing that takes off from Tokyo and lands in Beijing, then proceeds to Seoul, then to Manila) - and of course the (again, anonymized) passenger data (ie. the guy who flies from Beijing to LA via Dubai).
This then, merged with other relevant information, like the weather, population and geography of the locations in question calculates the likelihood of where the virus will set foot next.
The model proved successful: BlueDot effectively predicted the first few cities where the outbreak will be significant. Then, as the virus spreads, the model gets more and more complex, eventually resembling a weather forecast. Event at a really advanced stage of the virus, the experts can get a relatively accurate idea as to what will happen the next day, while the accuracy of the prediction sharply decreases with moving further ahead in time.
Then, a professional problem
The final stage of BlueDot’s model has to do with what happens when an outbreak is expected in the coming days in a specific city. The company’s professionals found that timely information, directed especially at the public healthcare institutions, healthcare workers and relevant authorities is the best option in preparing against a virus, so they developed a warning system that gives heads up to all those who need that the most.
Great choice! But since it has nothing to do with machine learning at this point, we’ll just leave that to the pros.
So what are, then, the lessons learned from BlueDot’s model, that can, or at least should be implemented in virtually any application of machine learning?
- The system is not running completely autopilot. While the application does a lot, when it crawls thousands and thousands of pages of content on end (and does it in multiple languages), some of the key decisions are made by human beings, using the machine-processed data as an augmentation in their decision-making processes.
- It is a hybrid system. While the most spectacular part of the application has to do with machine learning, it is not just ML, not all ML. The models built in predicting the distribution involve a lot of statistical modeling, and the final stage of the process is nothing more (or nothing less, I should say), than the work of a professional public healthcare expert.
- The hybrid system requires a hybrid team. While processing the vast information of healthcare-related forums and airline ticket sales requires a substantial IT background, it is just one side of the equation. Physicians, public healthcare experts, statisticians, designers were all involved in the developing of the project. This hybrid team is not only great fun. This is also the only way to develop a solution to such a complex problem, using - among many others - machine learning techniques as well.