Detecting the emergence of the next disease X

Science at work 20 July 2020
The emergence of Covid-19 in late December 2019 was detected on line by certain surveillance systems. However, the weak signals were buried beneath a mountain of data, and were not interpreted in time. In an article published in Transboundary and Emerging Diseases on 20 July 2020, a team of researchers from CIRAD looks back at the vocabulary used on line to describe the new disease. Their research, conducted as part of the EU MOOD project, will serve to improve the systems set up to detect the next disease X.
More than 60% of new infectious diseases come from animals © P-Y. Le Gal, CIRAD
More than 60% of new infectious diseases come from animals © P-Y. Le Gal, CIRAD

More than 60% of new infectious diseases come from animals © P-Y. Le Gal, CIRAD

On 31 December 2019, health officials in Wuhan, China, reported a cluster of 27 cases of a "pneumonia of unknown cause". The same day, PADI-Web and HealthMap identified several online articles referring to a "mystery disease". ProMed, for its part, had detected the same type of vocabulary in the online media the day before the official notification from China.

These three artificial intelligence systems are classed as "EBS" (Event-Based Surveillance). Each day, they review hundreds of thousands of online articles to monitor the emergence or spread of certain diseases. CIRAD's Mathieu Roche, one of the authors of the article in Transboundary and Emerging Diseases , compares his work as a data miner to that of a gold digger: "We rummage through vast amounts of data, so we need to be able to sort useful information from useless information efficiently. Our surveillance systems act as a sieve to sort the flakes of gold from the grit. Our aim is to find the nuggets, which in our case are the weak signals of the emergence of a disease".

A vocabulary centring on "mystery" and "pneumonia"

Certain surveillance systems focus on existing diseases such as Ebola or African swine fever. However, for new diseases, researchers use "syndromic" RSS feeds. "The aim is no longer to target a specific disease", Mathieu Roche explains. "We look more for keywords relating to symptoms, mystery phenomena or signs of concern."

For Covid-19, the multidisciplinary work done by Mathieu Roche and Renaud Lancelot's teams pinpointed a vocabulary centring on "mystery" and "pneumonia" as the disease emerged. "Prior to formal identification of the virus, we find articles about a 'mystery disease' or 'pneumonia of unknown cause'", Mathieu Roche adds. "Subsequently, once the medical profession gets involved, more technical terms are used."

"Knowing more about the vocabulary used depending on the stage of disease evolution would enable us to improve our surveillance systems", Mathieu Roche says. "The more we are able to pinpoint a specific vocabulary, the more precise identification will be. It's as if we were using a finer sieve."

The researchers hope that their retrospective analysis will serve to build even more effective surveillance systems in future.

This research was conducted as part of the EU MOOD project, which set out to harmonize health surveillance in Europe. MOOD uses "model diseases" (see box below), classed according to how they are transmitted. Covid-19 is currently a model for surveillance of as yet unknown diseases, called X.

Diseases monitored by the EU MOOD project

Unknown pathogens (disease X), which are a challenge for any epidemic surveillance system;
Influenza (all types of the virus) for airborne pathogens;
Tick-borne encephalitis and Lyme disease as model endemic pathogens transmitted by endemic vectors;
West Nile virus and Usutu virus as examples of exotic pathogens transmitted by endemic vectors;
• The chikungunya, dengue and Zika viruses as model exotic pathogens transmitted by invasive mosquito species;
Tularaemia and leptospirosis as model neglected endemic pathogens with multiple modes of transmission and reservoirs;
Antibiotic-resistant bacterial strains as examples of the threats posed by complex and anthropogenic diseases.


Sarah Valentin, Alizé Mercier, Renaud Lancelot, Mathieu Roche, Elena Arsevska. Monitoring online media reports for the early detection of unknown diseases: insights from a retrospective study of COVID-19 emergence