Perfecting Data Visualization with Plotly Using Open-Source Data
Humans rely on audio and visual stimuli to navigate the surrounding world. An entire industry exists to capitalize on these senses and successfully convinces millions of people to purchase their products and services on a daily basis. For most of the professional world, sight is the leading intuition that drives value. For engineers, visual prowess can be demonstrated through graphs and figures, computer aided design, and artful manufacturing. One of the easiest ways to impress employers, colleagues, or clients is to maintain high visual stimuli when presenting work. Below is a simple example of a visually pleasing and thought-provoking plot. A quality figure should have complimentary colors, visible fonts, descriptive labels and titles, and needs to tell a story. Now, this is no masterpiece of a figure, but it's close to the bare-minimum expected when presenting on a professional level. Arguments could be made as to whether the font is large enough, or the colors are right, or perhaps the data isn't significant enough; but it serves as a basis for expectation.
Notice the downward trend in total consumption but the opposite in the total population. This indicates that the general public of New York City has continually and consistently decreased its water intake since roughly 1980 [data publicly available at: opendataNYC].
The data presented above is a publicly available, open-source governmental database called OpenDataNYC. Another great resource is the National Oceanic and Atmospheric Association (NOAA) website, where there are gigabytes and even terabytes of available data waiting to be processed and plotted. For more general science-related datasets Nature published a great article that references several useful data repositories [see here]. Because there are countless databases scattered throughout the web, it is wise familiarize oneself with parsing and processing data of various formats.
3-dimensional representation of weekly death statistics published by the CDC for 122 cities (most with a population over 100,000). It is interesting to note both weekly and yearly trends, indicating there may be periods throughout the year when deaths are more common. This data has not been corrected for population increase/decreases. [data publicly available at: data.cdc.gov].
In atmospheric science, researchers use a technique called remote sensing to record and analyze the climatological and meteorological trends and cycles of the earth. Remote sensing employs autonomous data collection over large periods of time, too long to be supervised by humans, and permits intermittent retrieval and analysis. Satellites, anemometers, LiDAR devices [see here], microwave radiometers [here], and passive infrared gas analyzers [here] are just a few types of sensors that atmospheric scientists use when studying the atmosphere.
Numerous resources are available to scientists, especially in the U.S., that encourage the study of weather and climate-related events. I will be using historic wind data from an instrument located on the eastern tip of San Francisco, CA, which is openly accessible to the public [see here]. The instrument recorded wind velocity, wind direction, ambient temperature, and horizontal solar radiation, although I will only be using the first three.
Plot of three variables taken from a weather station in San Francisco for the entire year of 2014: wind direction, wind velocity, and air temperature. The year-long data was averaged for each hour in a 24 hour day, which gives the characteristic diurnal profile shown above. The year-long averaging is not necessarily indicative of the true behavior of the atmosphere, because it is likely that seasons cause variation in wind behavior, however, absolute trends should remain [data publicly available at: data.sfgov.org].
For a dataset as extensive as the one used here, the possibilities are endless. There are 35 weather stations with roughly 6-7 years worth of data sampled at either 5 or 15 minute intervals. This leaves over 1 billion data points to be analyzed. There is ample opportunity for correlation between stations and even comparisons between weekly, monthly, or yearly trends. I only cover one station in 2014 and its seasonal and year-long variations, however, the data is available if a more in-depth investigation is desired.
Plotted above is the diurnal, hourly averaged air temperature, wind direction, and wind velocity for the Pier 40 San Francisco site located at the northeastern tip of the city. One can observe a diurnal profile typical for those three variables. It is important to note, that it is a year-long plot, so seasonal behavior may be muted, which is why there are three plots below demonstrating the seasonal (monthly) variability of the three variables.
This is a seasonal variability subplot that indicates the diurnal variability by season (month). It is important to note the rigidity of the temperature data - it is often consistent despite the season (apart from amplitude, of course - winter = colder, summer = warmer). Top Left: Upon inspection of the wind direction we see more erradic behavior. This is due to the nature of the wind direction and its dependence on the topographical and anthropogenic structures that surround the area. However, there is a clear diurnal yearly southern wind in the morning, and a southwestern wind midday. This agrees with averaged data taken from a nearby sensor. Top Right: The wind velocity and temperature are going to behave similarly because of the influence of the sun on advection, conduction, and convection on the surface of the earth. It is, however, interesting to note the stillness of the wind velocity during the winter months. There is even a suggestion by the data that the wind is constant during the winter months or perhaps more aggresive during the morning (contrary to physical intuition). Bottom: A typical diurnal average for a year. This is effectively a staple of meteorological data analysis. A peak is seen midday when the earth is heated by the sun, and a local minimum can be observed in the late morning after the surface has cooled. It is also interesting that San Francisco's coolest months were December and February. This is an interesting result. [data publicly available at: data.sfgov.org].
Overall, my goal here was to inform other scientists, engineers, and data miners that there are wells of information available online and open to the public. Each of the three datasets used above were real-world examples with potential research-grade statistical and physical significance. With the appropriate cultivation and experience, there are ample opportunities to publish meaningful results using data available to the public. I hope this coverage of open data and the resource poltly were beneficial and encourage involvement in the open source community.
See more in Data Analysis: