Maker Portal

View Original

Python FTP for Data Mining and Analysis

See this content in the original post

Python’s file transfer protocol (FTP) library, called ftplib, is a powerful tool for scraping data off of the internet. For this project, I will be downloading weather data from the Automated Surface Observing System (ASOS) , which can be useful for weather models and forecasts. The ASOS network can also be used to calibrate satellite data, characterize incoming and moving storms, and direct air traffic. The ASOS has a one minute sample interval that is available via the FTP protocol, which we will extensively analyze and visualize using Python.


See this content in the original post

Python 3’s “ftplib” is a fairly simple interface for accessing data via the FTP method. To start, the basic example on Python’s ftplib page suffices as an introduction to the library’s methods. We will change the FTP server from debian.org to ‘ftp.ncdc.noaa.gov’ - where the ASOS data is housed. This is shown below in the code snippet:

See this content in the original post

The snippet of code should printout the following directory at the “ncdc.noaa.gov” FTP server:

And we can see what libraries are available for navigation and reading. The libraries specifically important to us is the ‘pub’ library, which stands for public. This library will allow us to navigate through it and see what data is publicly available for FTP download.

If we now change our directory by navigating to ‘pub/data’ we can see all of the available public datasets provided by the National Centers for Environmental Information (NCEI), formerly called National Climatic Data Center (hence the ftp server ncdc.noaa.gov). We can navigate to and print out the public data using the following two simple lines of code:

See this content in the original post

The printed list will be fairly long, and most of them should be available to navigate and download data from. The folder that we are interested in is called ‘asos-onemin’ - which is the one-minute resolution ASOS data, which provides temperature, humidity, barometric pressure, and much more! We will update our FTP navigation to that folder and then print out the data files present in the ASOS one-minute data folder:

See this content in the original post

And finally, the names of the files/directories in a given FTP directory can be saved using the ‘nlst()’ command:

See this content in the original post

The variable ‘dirs’ now contains all of the directories with our ASOS data. Each folder name is a different one-minute ASOS station product for a given year. The reason we know this is we can look at the readme.txt file in the directory! We can download the non-directory files, which will make understanding the ASOS one-minute files. We can use the following script:

See this content in the original post

For the ASOS system, we get four files, one of which is the readme.txt file, and the others are description files. I recommend looking at all of the deposited files for better understanding and reference of how the ASOS network functions and how the data is distributed.


See this content in the original post

Before we dive into the data files themselves, it will be helpful to download the station description file, which is located at the following ftp address:

ftp.ncdc.noaa.gov/pub/data/ASOS_Station_Photos/asos-stations.txt

We first need to navigate to the ASOS_Station_Photos folder via FTP, then we can download the file and sift through it to understand each row and how it pertains to the ASOS network and the respective stations.

As an example of how to sift through the station array, I have included a larger snippet of code that goes through the processes mentioned above, as well as sifts through the asos-station.txt file and pulls the station information. I use the general coordinates of New York City to test whether the program is working. The output of the script below should be a few checks for whether the asos-onemin files were downloaded, whether the asos-stations.txt file was downloaded, and prints out the nearest station to input coordinates:

See this content in the original post

If everything worked as expected, the last line should read something like:

-----------

The nearest station to 40.7128,-74.006 is NEW YORK CNTRL PK TWR (ID: KNYC) at 40.77889,-73.96917

which states that the nearest station to the center of NYC is the KNYC station, which is the expected answer! If we were to input another pair of coordinates, say, for Los Angeles, we would get the printout:

-----------

The nearest station to 34.0522,-118.2437 is LOS ANGELES DWTN USC CAMPUS (ID: KCQT) at 34.0236,-118.2911

which also is as expected, since the University of Southern California campus station is likely the nearest to the central coordinates of Los Angeles.

We can plot the station latitudes and longitudes from the entire station database, and get an idea of just how many stations there are - just in the continental U.S.:

Now that we’ve identified stations based on coordinates, we can identify the station IDs and use those to extract data from the asos-onemin data folder!


See this content in the original post

Now that we have a way of identifying the stations and their properties, we can look at how the data files are saved in the asos-onemin folder. By looking at the very first file in the ‘data_files’ variable produced by the script above, we can see the format of the station file:

See this content in the original post

This format tells us a few things:

  1. 6405 is the data product type

  2. 0 is a space holder

  3. K1J0 is the station identifier

  4. 2019 is the year

  5. 01 is the month

  6. .dat is the data file type

We can use the file types to search for our station of interest. The station identifier can be found using our code above, which is found in the asos-stations.txt file. Using geographic coordinates of interest, we can find our nearest station, use the identifier to find the station data, then download those files relevant to our geographic coordinates. This is done in the script below.

See this content in the original post

After running the script above, the nearest station and its data will be downloaded and added to a folder in the local directory, called ‘/data/’ - which is where we will subsequently read and parse real data from.


See this content in the original post

Now that we have the data files stored locally, we can begin to parse and visualize the data. What we can do in this case is open the .dat files using Python’s csv reader, and read in the files as fixed-width files, and ultimately separate the data based on the fixed width of the data rows.

This is done below in the given code. The additional lines are in continuation of the code above, so look for the added lines as guidance for how the parsing is taking place.

See this content in the original post

The code above is a lot to take in. It handles all of the aforementioned processes, as well as some simple data parsing and handling. Perhaps the most interesting is the calculation of relative humidity. It involves a few lines of pressure and temperature calculations, which arrive at an approximate value for relative humidity. The resulting plot should be nearly identical to the one below, depending on the data type and station selection.

Notice that we have one-minute resolution data (with dropped points, of course) for an entire month. This produces up to 40k data points, which is a fairly large amount of data to work with from a single weather station.


See this content in the original post

This tutorial focused on Python’s file transfer protocol (FTP) for parsing weather station data available as open source data from the National Climatic Data Center (NCDC). The flexibility of the codes presented above allows users to parse information completely by scripting. This allows for automated visualization and analysis of weather data across the country (the world as well!). FTP is a powerful tool that facilitates this type of analysis, and allows programmers to look at large amounts of data without needing to manually scroll and parse through it all. This tutorial was meant as an introduction to the capabilities of Python’s FTP library, while also showing a real-world example of how to use FTP methods and the data downloaded in an automated fashion.

See this content in the original post

See More in Python and Programming:

See this content in the original post