Development of a COVID-19 forecasting model based on synthetic individual data

The ongoing COVID-19 pandemic has demonstrated the shortcoming of epidemiological modeling for guiding policy decisions. Due to the lack of public data on infection spread in contact networks and individual courses of disease, current forecasting models rely heavily on unreliable population statistics and ad hoc parameters, resulting in forecasts with high uncertainty.

In this project, we aim to develop a COVID-19 forecasting model based on individual course of disease data. Our hypothesis is that the model trained on 100 subjects’ course of disease would perform better than current state of the art models using population data.

To tackle the problem of insufficient public individual data, we propose an algorithm to generate a synthetic Taiwanese COVID-19 dataset and demonstrate the usage of this dataset in epidemiological forecasting. We collected COVID-19 data from Taiwanese public databases for the period when the original SARS-CoV-2 virus was most prevalent (Jan-Oct, 2020). The Firefly algorithm is used to optimize epidemiological parameters and the synthetic dataset is validated by comparison to Taiwanese public data and clinic observations. Futhermore, we plan to construct a deep neural network (DNN) of transformer type with the stateof-the-art linear constraint method –POLICE–to forecast spread of future variants based on observation of a small set of individuals course of disease. Enforcement of constraints is necessary since disease spread has physical constraints, such as the population size.

Our synthetic dataset contain each subject’s age, gender, job, and social context within household, school class, workgroup, healthcare system, and municipality with detailed secondary daily contacts and effective contacts. It also incorporate the course of disease of each subject, i.e. infection date, latent period, infectious period, negative/positive test date, symptomatic date, critically ill date, recovered date, and day of death. Preliminary results demonstrate that our synthetic dataset can be used to train and evaluate state-of-the-art epidemiological models. We trained the epidemiological model SIKJ with a reference period of 180 days and forecasted 50 days ahead. The trend predicted by SIKJ agreed with our ground truth synthetic dataset. Our data synthesis algorithm provides a valid alternative to benchmark epidemiological models and advances COVID-19 forecasting research. Our DNN model is expected to provide more accurate forecasts with lower uncertainty than current state-of-the-art models.

Keywords: COVID-19, individual data, synthetic dataset, epidemiological model, deep neural network




持續的 COVID-19疫情顯示了流行病學模型在指導決策方面的缺陷。由於缺乏關於感染傳播和個別病程的公共數據,目前的預測模型嚴重依賴不可靠的人口統計數據和 ad hoc 參數,導致預測具有很高的不確定性。

在此研究,我們的目標是開發一個基於個人數據的 COVID-19 預測模型。我們的假設是,針對 100 名受試者的病程進行訓練的模型將比使用人口數據的最先進模型表現更好。為了解決公共個人數據不足的問題,我們提出了一種合成台灣COVID-19 數據集的演算法,並演示了該數據集在流行病學建模中的用途。我們從台灣公共數據庫收集了原始 SARS-CoV-2 病毒最流行時期(2020年1月至10月)的數據。螢火蟲算法用於優化流行病學參數,並通過與台灣公共數據和臨床觀察的比較來驗證合成數據集。此外,我們計劃用最先進的線性約束方法–POLICE–來構建一個 transformer型的深度神經網路模型(DNN)。增加限制是必要的,因為疾病傳播有一定的物理限制,例如感染病例不應超過人口規模。

我們的合成數據集包括每個受試者的年齡、性別、工作、家庭、班級、工作組、診所和社群,以及詳細的日常接觸和有效接觸。它還包括每個受試者的病程,即感染日期、潛伏期、傳染期、陰性/陽性測試日期、症狀日期、病危日期、康復日期和死亡日期。初步結果表明,我們的合成數據集可用於訓練和評估最先進的流行病學模型。我們訓練了流行病學模型SIKJα,參考期為 180 天,並預測未來 50 天之確診數與死亡數。SIKJα 預測的趨勢與我們的地面實況合成數據集一致。我們的數據合成演算法提供了流行病學建模的有效替代方案,並推進 COVID-19 的預測建模研究。與當前最先進的模型相比,我們預期我們的 DNN 模型能提供更準確的預測,並且不確定性更低。


Principal Investigator: Ass. Prof. Torbjörn Nordling.


Members: Rain Wu, Torbjörn Nordling.