Maker Portal

View Original

Visualizing COVID-19 Data in Python

See this content in the original post

A lot of data surrounding COVID-19 cases are scattered throughout the web, along with various visualizations and figures. This blog post is aimed at creating meaningful visualizations that may or may not be available elsewhere, while instructing users on how to source, analyze, and visualize COVID-19 infection case and rate data using Python. All of the data used herein is publicly available for anyone interested in replicating the figures, with code and links where necessary. The methods used here have been uniquely conceived and developed by Maker Portal, and in no way reflect preferred methods of either the government or any other private entities. Several Python toolboxes will be implemented below, and it is recommended that users install and verify their functionality before attempting to replicate the forthcoming figures. The visualizations below were computed at a static period in time, and thus, will have a timestamp of the date when the data was retrieved. There is also a Github repository of codes and high-resolution figures at https://github.com/makerportal/COVID19-visualizations.


See this content in the original post

Python’s “requests” library will be used to scrape data from Github, using their raw user content repository format. Using requests, we can 'get' data from .csv files using the following simple method:

See this content in the original post

Looking at the variables 'txt' we can see that it houses all the data from New York City’s Coronavirus (COVID-19) data repository for each of its 5 boroughs, housed publicly on Github. The 'header' variable, if printed out, shows the data headers for the .csv file selected. If we were to change the index of the 'data_file_url' variable to 1, then we would scrape the 'by-age.csv' data instead of the 'boro.csv' data, and so on. What will be done going forward is parsing and plotting of data. The format given above is the foundation of acquiring the data, and going forward, the plotting and programming will become increasingly more complex.

The data used here is acquired from the NYC Health Department, with no adaptations whatsoever:

https://www1.nyc.gov/site/doh/covid/covid-19-data.page

For a breakdown of the data and variables, see the NYC COVID-19 Github repository:

https://github.com/nychealth/coronavirus-data

The first set of data analyzed involves cases, hospitalizations, and deaths by age and sex.


See this content in the original post

This section involves visualization of age/sex data based on COVID-19 rates of confirmed cases, hospitalizations, and deaths. The use of Python bar charts will help us compare each of the rates by sex and age group. The age group visualization is given below:

The code to replicate the age rate bar chart is also given below:

See this content in the original post

It is easy to see that there is a sharp dependence on age with respect to each of the case rates, hospitalization rates, and death rates.

In Python, the horizontal bar chart is used. The rates by sex can be plotted almost exactly the same, as shown below:

For the case of rates by sex, the vertical bar chart proved to be better for visualization.

The code to replicate this plot is also given below:

See this content in the original post

Again, there is a clear distinction between rates and sex, leaning heavier on males rather than women. This indicates that men have higher rates of infection, higher hospitalization rates, and higher death rates. In the next section, we will see how cases have evolved over time for COVID-19.


See this content in the original post

In the previous section, data was partitioned based on distinct factors: sex and age group. In this section, the partitioning will shift to time. This allows us to see possible trends in data over time, as well as make comparisons with models or projections. The simple time series for new positive cases, new hospitalizations, and new deaths is given in the top subplot below; while the sum of total cases, hospitalizations, and deaths is given in the bottom subplot below:

This visualization is slightly more complicated than the previous two, as it requires some date conversions and mathematics. The code is given below, followed by a description of the methods:

See this content in the original post

In the code above, much remains the same in the first few lines - particularly with respect to scraping and parsing data. The first divergence occurs when we convert the times into what is called a 'datetime' variable in Python. This datetime variable handles much of the time-series methods such as formatting dates and handling years, months, and days.

Next, the new cases, hospitalization, and death rates were plotted raw on the first subplot, while on the second subplot the values needed to be added cumulatively (hence the 'np.cumsum()' function. Lastly, the values were plotted and formatted using a logarithmic y-axis, which is presented as seen in the plot above. In the next section, we’ll move from simple 2-D variables to more complex 3-D variables, which use spatial indexing to map data with geographic coordinates.


See this content in the original post

Visualizing geographic information is one of the most useful tools in epidemiology - as it can give scientists an idea of how specific areas are affected and what generalizations they can make regarding specific regions. Below is a simple geographic visualization of New York City, showing how each borough of NYC is affected by COVID-19:

We can see from the figure above that The Bronx and Staten Island seem to have the highest case rate of the five boroughs. Conversely, Manhattan, the borough with the highest population density, seems to have the lowest rate of infection of COVID-19.

In Python, the basemap toolkit is used to produce the geographic visualization shown above. The coloration of each borough also uses a colormap, defined as a range of reds, where less red indicates less severity, and deeper red indicates more severity in infection rate. Below is the code used to replicate the figure above:

See this content in the original post

The last and final visualization presented is the zip code map infection rate of COVID-19 in New York City. This is the most advanced plot, as it requires some dynamic modification of coloring, labeling, and placing of data. This is a 3-D visualization, which includes 2-D geographic information and COVID-19 positive testing rates (colored in blue/purple).

The code to replicate this plot is again given below:

See this content in the original post

This last geographic visualization is perhaps the most interesting, in that it gives city officials an idea of where COVID-19 spreading and may give insight into why certain areas are affected and why others are not. The code to replicate this is somewhat similar to the previous, especially the borough map. However, there is some particular implementations that make this geographic plot more complex than the others. The first, is that we had to map the rates in order to get the full scale of the colormap to show best how the zip codes differ. Then, the zip codes had to be overlaid their respective boundaries - which was done by finding the center point of each shape and mapping the zip code labels to those points. The labels were also rotated based on the rough rotational axis of each shape. And finally, the font size of the labels had to be scaled in order to avoid overlap and also fit properly within the bounds of their respective zip code boundaries.

Some data sifting was also needed - as many of the zip codes in the NYC shapefile did not have data (the white spots). It is difficult to see all of the zip codes within Manhattan due to the limited resolution of the image available on this site, which is why a full-sized image was uploaded to this project’s Github page, and it is available for download there:

https://github.com/makerportal/COVID19-visualizations

The Github contains much of the same information as is given here, but with higher resolution images, in case users want to use those instead of the compressed images on this site.


See this content in the original post

Countries and cities around the world produce millions of data points relating to health crises such as the COVID-19 pandemic. This is both beneficial and detrimental to the public’s relation to the severity of diseases and outbreaks - as much of the data is misconstrued or poorly understood. This is why, in this tutorial, we aimed to use this publicly available information and create several coding routines to produce meaningful visualizations of COVID-19 data. The data was acquired directly from New York City’s Department of Health, and was parsed directly into Python without any manipulation. Python has a multitude of powerful tools for analyzing and visualizing information, and several of those tools were used here. Bar charts were used to visualize less complex data, and indicated that age plays a major role in COVID-19 rates in NYC; the same was concluded for sex. Next, the time-series representation of NYC infections was plotted for both new cases and cumulative rates - indicating that over the range plotted (March - early April) the logarithmic slope of rates was plateauing. And finally, geographic maps were used to plot borough and zip code rates across New York City. This concludes the Python analysis of COVID-19 data and perhaps, if there is interest, another analysis will be conducted on a larger dataset.

See this content in the original post

See More in Python and Data Analysis:

See this content in the original post