Summary
SYNSYS uses search trend data from public domain sources and analyzes it to predict viral outbreaks before they happen.
Description
Description
SYNSYS is a syndromic surveillance system designed for the public and private healthcare sectors. The system uses exclusively public domain data from search engines. These data sets can be analyzed and used to predict outbreaks before they happen. This system can utilize machine learning at the elementary level to identify when search trends are abnormal based on past precedent, thereby flagging it for a potential epidemic or pandemic risk. This can be used to anticipate all-time highs once an epidemic or pandemic is already underway, like in the COVID-19 pandemic. SYNSYS originally won the NYAS Tracking Coronavirus Innovation Challenge in 2020, but has since been tweaked and upgraded to adapt to new data and developments.
The first part of the system, which uses search engine trends, works on the hypothesis that most people are more likely to self-medicate if they fall ill, and thus will want to search for their symptoms online to know what over-the-counter medication they should take. This is supported by an article published in Epidemiology and Infection that shared the results of a survey conducted in Beijing, showing that 69% of people prefer to self-medicate when ill. Any abnormalities or variances detected in real-time by the system will be flagged for analysis and scrutiny. This system will be able to detect an outbreak weeks before official diagnoses.
To test this hypothesis, I conducted a preliminary analysis using 2 COVID-19-related datasets: a 1-year long data set, covering a timeline between 6th Jan 2019 to 26th April 2020, and a 4-year long data set, covering a timeline between 1st Jan 2016 to 1st May 2020. The reason behind ending these timelines in mid-2020 is the widespread cognizance of the symptoms of COVID-19. The assumption upon which this operates is that the outbreak is undetected and unnamed. COVID-19 was only named a pandemic by the WHO in mid-March 2020; therefore, the timelines end shortly after. The purpose of the 4-year timeline is to establish how the program can differentiate between seasonal trends and anomalies. I used Google search trends for this demonstration as it is the most accessible and widely used in the United States, the sample country.
Here is the first composite graph of the Google Search Trends for 4 of COVID-19's most important symptoms/keywords: fever, cough, flu, and pneumonia:
As one can see through the annotations on this graph, there's a notably abnormal peak during the week of 22-29 December for 3/4 terms. The 4th term, flu, peaks abnormally shortly after. It is suspected that COVID-19 was present in the United States of America weeks before the first officially reported case in January 2020. The findings of this data analysis support that hypothesis. The largest peak on this graph belonged to the week when the WHO declared COVID-19 to be a pandemic.
However, to provide some more context here is the second composite graph of the same terms:
Now, this graph puts all peaks into context. There are some obvious standouts.
- There's an abnormally high peak for "pneumonia" in 2016 -- this actually corresponds to reports of presidential candidate Hillary Clinton receiving a pneumonia treatment.
- There's an abnormally high peak for "flu" in 2018 -- this corresponds to reports of America's worst flu season since the Swine Flu.
- There's a high peak for all terms in March 2020 -- this is when the WHO declared COVID-19 to be a pandemic, something visible in the previous graph as well.
The peak before the March 2020 peak -- which is approximately in late December 2019 -- lines up with the suspected COVID-related peaks noticed in the previous graph. However, the reasons a program would notice this as an anomaly instead of a pattern are as follows:
- Normally, flu season-related peaks take place in January-February, not December. This has been the case for the last 4 years. A system would automatically flag this because of its irregularity.
- Notice the individual peaks themselves. The peak for "pneumonia" is the second-highest it has even been in the last 4 years. This indicates an abnormal surge in the frequency at which people are searching for "pneumonia." In its early days, COVID-19 was often synonymous with pneumonia, to the extent that Chinese authorities determined the novel coronavirus through a patient with pneumonia.
- Accompanying this is the peak of "cough" -- a common COVID-19 symptom as well. Coughs are very deceiving, as most people would treat them through OTC medication. However, this is one symptom that untreated can develop into more severe complications. It makes sense that cough accompanies pneumonia during this peak, as the two do go hand-in-hand. This peak also the highest in the last 4 years. Together, all of this would point to an anomaly, as usually cough peaks higher than pneumonia during regular flu seasons. This time, it's lower.
All of this was modeled after the COVID-19 pandemic as we have the best access to such data only after 1st January 2016. However, a system that conducts such tracking for all common symptoms and illnesses would spot abnormalities that can differ from the novel coronavirus, thereby signaling any unprecedented or mysterious correlations.
Finally, to investigate how the system would work after a pandemic/epidemic is already underway, what came to mind was the frequency of searches for "COVID testing." Just a couple of weeks into a pandemic, most people are already aware of the symptoms they need to be wary of. In addition, since the disease is named, they would simply search for it. However, none of this points to possible COVID caseloads. "COVID testing," on the other hand, is of importance to people searching for the location or contact details of one, either because they're suspicious of their own symptoms or somebody else's. Based on this behavioral assumption, I conducted the same trend search for this key phrase from the beginning of the official pandemic (3rd March 2020) to 1st January 2021. Here's the graph:
In isolation, this doesn't tell us much except that even this key phrase had cyclic movement. However, superimposed upon the United States' daily cases graph since March paints a clearer picture of its relevance (lined up as accurately as possible, neon green signifying the previous trend line):
The graphs have been calibrated as accurately as possible, using the May 17 and Jan 1 dates to align the search trends trend line. The United States' Daily COVID cases graph has been obtained from the CDC's website.
Upon studying this, I found that the search trend peak occurred about 2 weeks before a peak in daily cases. At the very beginning of the pandemic, search trends for "COVID testing" began peaking before the first all-time high peak for the US during that period. Similarly, search trends peaked about 2 weeks before a new all-time high in July. The US has seen very high cases since November 2020; this is reflected in the fact that there has been no real downward trend for search trends since its peak in November 2020. A correlation like this would be sufficient for SYNSYS to flag potential danger periods and predict when all-time high caseloads are expected to happen.
However, this system alone, with just this data, isn't enough. More complex developments would include bringing in data from hospitals and pharmaceutical companies to see how hospital bed occupancy and OTC medication sales compare to Google search trends. Building a steady correlation would make the system smarter and more sensitive to anomalies that indicate outbreaks. This system can be made more sensitive when deployed by state governments, as the data can be localized to states and cities to keep a close track of developments.
For further reference, here's a research article from the Clinical Gastroenterology and Hepatology Journal that shows how increased internet search interest for GI symptoms may predict COVID-19 cases in US hotspots.
Taking this project one step further, here's a mockup of the software dashboard to demonstrate how the data will be presented:
The system would effectively redraw and map trend graphs based on raw search engine data and process it to analyze relative figures like "Abnormality" and "Probability of Outbreak" and display the trend line as it is. Taking the system one step further would be the ability to also comb through social media mentions for popular terms related to illnesses. This would further refine SYNSYS' precision.
All in all, one of the biggest pros of this system is that it's completely digital and reliant on public domain data. This means that neither does it rely on any physical production nor requires people to give up personal data and privacy. The core of this system is behavioral patterns and analysis. By relying on public domain data, it is implementable everywhere. It is private, scalable, non-intrusive, adaptable, cost-effective compared to physical solutions, and sustainable.
Who will take these actions?
A project like this would require high-stakes stakeholders. To be truly effective, it would need to be used by a country's government or by an agency/company hired by the government. Only then would the outcomes of this detection be put to actual preventive use. Additionally, once in use by the federal government, it can be made further sensitive by localizing it to states and cities to track suspected hotspots and the spread to other states/counties.
Additionally, support from pharmaceutical companies, medical research universities, and large hospitals all over the country would be important for the further development of this system. While the base of SYNSYS operates on public domain data, to make it truly sensitive would require this support.
What are the projected costs?
Monetary Costs:
- Hiring a dev team experienced in ML to develop SYNSYS for a 6mo-1yr period (7-10 people) (with compensation equivalent to a well-paying full-time ML job)
- Cloud data-hosting and application-hosting services
Thankfully, in the basic version of SYNSYS, privacy is maintained completely as SYNSYS only uses general data from Google search trends. Information volunteered by pharmaceutical companies or hospitals would not violate citizens' privacy either, as the system is not interested in individual pieces of data.
Timeline
- Development of SYNSYS -- 6 months to 1 year. This includes hiring developers, developing the system, alpha and beta testing, UI designing and development, and ML testing.
- Adoption by 1 government -- 1 - 2 years. Depending on the willingness of a government to adopt such a system. Governments interested in building their pandemic detection and response teams would benefit the most. This also takes into account negotiations and tweaking the system to be most accurate for the country (different search engines). This is an estimate that takes into account general bureaucracy.
- Deployment -- 2-3 months. As soon as SYNSYS is adopted by a government, it can be put to use immediately. Within 1-2 months, it would have attuned to the country's search trend's sensitivity and nuances and have adjusted its expectations for abnormalities accordingly. As mentioned previously, if put in use by state governments for localized data, it would take a month extra to be tuned accordingly.
- Additional data sources -- 1-2 years. This would take place at the same time as the adoption timeline. To make SYNSYS truly sensitive, having information specific to a country's healthcare system, providers, and hospitals would help SYNSYS adapt better and faster to the country's landscape.
About the author(s)
Esha Datanwala is a 21-year old recent graduate from Ashoka University, India. She is a member of the New York Academy of Sciences and has previously won 2 challenges organized by the Academy. She is currently based in Mumbai, India.
Juliano Queral Jan 7, 2021 08:18
Member
| Hi Esha! I’m a member of the NYAS too! Member since 2016 and winner of the PepsiCo micronutrient challenge. Really liked your project! |