class: center, middle, inverse, title-slide # Making Public Data Public ###
.black[Christoph Sax]
### Zurich R Meetup - November 7, 2017 --- class: center, middle, inverse # How to get Inflation Data? ??? Hi, my name is Christoph, and I am a R consultant. I am consulting private companies and largely federal public administration offices, mostly on things related to R. Until 4 years or so, I used to work for the state secretariat of economic affairs, the SECO where we integrated quarterly GDP estimation in R. We moved a process that was mainly based on excel, visual basic and some specialized software to a unified process in R. 3 CRAN packages originated from that time. Since then I was working with other federal offices on other R projects. For example, at the federal finance department, we produce consolidated governmental account in R, at the Federal Social Insurance Office, we model do rent modeling in R. So R is gaining a lot of ground. In some cases, R helps to make tedious Excel VBA things more integrated and automated, in other cases R can take over things that have been done with SQL database solutions, and allows the responsible business people to gain control back from IT. So R is greatly helping public administrations to become more organized and more efficient. But despite of that, I have seen much less change in the outptu they are producing. The ouptput still mostly are excel files, with weired formats, weired colors... Which brings me to the topic of this talk. PUBLIC DATA IS HARD TO FIND, HARD TO DOWNLOAD AND HARD TO PROCESS. I will show you by example how hard it is and will discuss some solutions.... Can it also be used to improve the products which they are producing? Which brings me to the main topic of this talk: How to make public data public? --- class: center, middle <iframe style="border-width: 0px" width="100%" height="460px" src = "./img/lik_iframe.html"></iframe> ??? Here is an Example. The consumer price index is measuring the price level and is one of the most important statistics. It's highly relevant for monetary policy, but it also gives guidance to the development of wages, pensions and rents. So how do we get this simple number? Or how do we produce a graph like this? --- # Where to Search? Federal Statistical Office FSO Cantonal Statistical Offices State Secretariat for Economic Affairs SECO Federal Customs Administrations Swiss National Bank SNB and many more... ??? If you search for a Statistic in Switzerland, it starts with the problem that you don't know which agency is producing it? Beside the Federal Statistical Office, there are the cantonal offices and many other federal offices. Plus there is the National Bank which produces some significant datasets. And then there are alos quite a few private data producers, such as the KOF, Credit Suisse, Wust Parter, etc... --- # Where to Search? .pull-left[ [.black[Unemployment Rate]](https://www.bfs.admin.ch/bfs/en/home/statistics/work-income/surveys/als.html) [.black[Exports of Services]](https://snb.ch/en/iabout/stat/statrep/id/statpub_bopiip_all) [.black[Exports of Goods]](https://www.ezv.admin.ch/ezv/en/home.html) [.black[Consumer Price Index]](https://www.bfs.admin.ch/bfs/en/home/statistics/prices.assetdetail.3522252.html) ] -- .pull-right[ FSO SECO Federal Customs Administrations Swiss National Bank SNB ] ??? So let's play the game... --- # Where to Search? .pull-left[ [.green[Unemployment Rate]](https://www.bfs.admin.ch/bfs/en/home/statistics/work-income/surveys/als.html) [.black[Exports of Services]](https://snb.ch/en/iabout/stat/statrep/id/statpub_bopiip_all) [.black[Exports of Goods]](https://www.ezv.admin.ch/ezv/en/home.html) [.black[Consumer Price Index]](https://www.bfs.admin.ch/bfs/en/home/statistics/prices.assetdetail.3522252.html) ] .pull-right[ FSO .green[SECO] Federal Customs Administrations Swiss National Bank SNB ] --- # Where to Search? .pull-left[ [.green[Unemployment Rate]](https://www.bfs.admin.ch/bfs/en/home/statistics/work-income/surveys/als.html) [.orange[Exports of Services]](https://snb.ch/en/iabout/stat/statrep/id/statpub_bopiip_all) [.black[Exports of Goods]](https://www.ezv.admin.ch/ezv/en/home.html) [.black[Consumer Price Index]](https://www.bfs.admin.ch/bfs/en/home/statistics/prices.assetdetail.3522252.html) ] .pull-right[ FSO .green[SECO] Federal Customs Administrations .orange[Swiss National Bank SNB] ] --- # Where to Search? .pull-left[ [.green[Unemployment Rate]](https://www.bfs.admin.ch/bfs/en/home/statistics/work-income/surveys/als.html) [.orange[Exports of Services]](https://snb.ch/en/iabout/stat/statrep/id/statpub_bopiip_all) [.blue[Exports of Goods]](https://www.ezv.admin.ch/ezv/en/home.html) [.black[Consumer Price Index]](https://www.bfs.admin.ch/bfs/en/home/statistics/prices.assetdetail.3522252.html) ] .pull-right[ FSO .green[SECO] .blue[Federal Customs Administrations] .orange[Swiss National Bank SNB] ] ??? Let's play the game. Which of the following series is published by what agency? The unemployment rate? --- # How to Get the Data? We have a [link](https://www.bfs.admin.ch/bfs/en/home/statistics/prices.assetdetail.3522252.html): ``` https://www.bfs.admin.ch/bfs/en/home/statistics/ prices.assetdetail.3522252.html ``` -- .green[But it is not stable...] ??? Now we figured out where to find the data. We went to the FSO website, searched for while and finally found a link. Now we would like to use this link to update our data each month. Can we? Unfortunately no, it turns out the link is outdated the next time we go there. These links are not stable. --- class: middle, center # How to Process the Data? ??? Let's say we manage to overcome this difficulties as well. Now we are left with the Excel. --- background-image: url("img/bfs_lik_file.png") background-size: 100% ??? Which means the real fun begins. Just have a look at the file. The data is in a partially wide, partially long format. Months are as columns, years as rows. And there are summarizing columns, empty rows, and other stuff around. --- # How to Process the Data? ```r library(readxl) library(dplyr) library(tidyr) link <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/2503582/master" tfile <- paste0(tempfile(), ".xls") download.file(link, tfile) ``` ??? So this is really great stuff for a dplyr data cleaning exercise. --- # How to Process the Data? ```r dta <- # read data, skip the first 10 rows read_excel(tfile, sheet = "2015", skip = 10) %>% # remove empty rows filter(!is.na(X__1)) %>% # do no use the three final columns select(-X__2, -X__3, -`5266`) %>% # reasonable column names setNames(c("year", 1:12)) %>% ``` --- # How to Process the Data? ```r # get rid of pseudo wide data structure gather(month, cpi, -year) %>% # construct time stamp mutate(date = as.Date(paste(year, month, "1", sep = "-"))) %>% # and clean up select(date, cpi) %>% filter(!is.na(cpi)) arrange(date) ``` --- # How to Process the Data? ```r dta ## # A tibble: 413 x 2 ## date cpi ## <date> <dbl> ## 1 1982-12-01 63.8 ## 2 1983-01-01 63.7 ## 3 1983-02-01 63.8 ## 4 1983-03-01 63.9 ## 5 1983-04-01 64.1 ## 6 1983-05-01 64.2 ## 7 1983-06-01 64.4 ## 8 1983-07-01 64.3 ## 9 1983-08-01 64.5 ## 10 1983-09-01 64.6 ## # ... with 403 more rows ``` ??? So this is a nice exercise, but it is of course terrible! --- # The Three Plagues of Inflation Research 1. Finding the Data 1. Downloading the Data 1. Processing the Data ??? Which completes the three plagues of inflation research: Finding, Downloading and Processing. Each step terrible on its own. So how can we solve it? --- class: inverse, center, middle # Solutions? ??? There are a number of ideas around, an I will quickly review them. --- background-image: url("img/opendata.png") background-size: 100% ??? Opendata.swiss is an attempt to just do that: making public data public. It is far from complete, however, and it seems to be largely ignored by the Federal Statistical Office. So, this won't help us here. --- background-image: url("img/fred.png") background-size: 100% ??? The US are very similar situation, with an equally decentralized system of data collection. But here the one of the federal reserve banks came up with a reasonable solution. FRED collects data from all the providers and makes it available quite easy. I still would think the interface could be simplified and improved, but this seems still to be the closest think of what we need. --- background-image: url("img/quandl.png") background-size: 100% ??? Quandl, a commercial data provider that offers some open data as well, has nice interfaces to most software packages. It has some Swiss data, but of course far from complete. --- background-image: url("img/gapminder.png") background-size: 100% ??? Not too helpful in our inflation quest, but a very interesting site on its own. Created by Hans Roesling, a Swedish Professor how recently died, this offers the greatest collection of global data. The scope are long term development and health data. What I like a lot about the site that it really attempts to do what the internet ist best at: visualizing the data. This is actually a page where you like spending time at, just enjoying at the data, without being looking for a specific series. --- background-image: url("img/snb.png") background-size: 100% ??? And finally, the new data portal by the national bank. This is the closest think to being useful. They offer a lot of data tha is not from the bank, but from some other federal agencies. It has a stable API and is currently the best source for machine reading of data. The downside: Data is only available after a few weeks. If you are really interested in commenting on latest CPI developments, this is too late, It also fails the 'oh i am enjoying this' test. --- background-image: url("img/dataseries.png") background-size: 100% ??? Which leaves us with a hole to be filled, and dataseries is an attempt to fill it. This was mostly a pet project of mine, but I hope I can make it a bit more seriously. I started to do some interesting collaboration recently, but I will tell you about that later on. So, this is an attempt to bring all relvant data into a single place, and lets just see how this works for Consumer Prices. So let's go to the website, you can try it as well if you want, hope the server can handle it. There is only a single thing you can do here, and it brings you right to the series. This is all available data on the CPI, so contrary to the FSO file, this included older data with a different base. We can zoom in, zoom out, transform etc. All the the nice stuff. When you are looking for consumer prices, dataseries shows you the series that is most relevant most of the time. So it gives you the overall CPI here. But dataseries also tries to be complete, so so it does not stop here, but gives you all the subcomponents of the CPI, You will find these data at the FSO as well, but it is distributed among many files, each with its own logic. So just lookting at the data is nice, but there is some other stuff you can do. You can download the graph in some usual formats, but you can also copy this link and share it with others, This will reproduce the current view with the latest data. If you have a website, you can include an interactive iframe to it. Let's say you are a newspaper, importing this graph is as easy as it can be, Finally, we are all R users, so the ultimate interest is how to bring the data into our beloved console. Click here, install the data series package from CRAN if you dont have it, and just use this code to get you the latest data. One note about completeness. The aim is to have all 'relevant' data here, so this is of course a bit a subjective choice. I think we are pretty close in the field of Economics. --- # Goals 1. A structured collection of the relevant data series for Switzerland. 1. An intuitive way to browse the data visually. 1. Automatically updated from various sources. 1. Data available for free, without login, easily importable into major software programs. 1. Include historical data, whenever possible. --- # Technicalities GUI - Powered by Shiny - UI developed in HTML - User defined JavaScript elements - Time Series by the dygraphs R package / JS library Database - In-memory `data.table`, containing all time series in a long format. Scrapping - `rvest`, `data.table` and `dplyr`. --- # Import to R ```r library(dataseries) plot(ds(c("CCI.AIK", "ts"))) ``` --- class: inverse, center, middle # Outlook --- # Modularizing 1. Data Scrapping 1. Visualization 1. Research on Historical Series 1. API Usage --- # Data Scrapping Collaboration with ETHZ KOF Scrapping FSO (and other) federal agencies Make data available in a format that can be easily read by machines KOF and dataseries.org will then read from this scrapping server ??? Data Scrapping is currently the most active subproject here. We started an interesting collaboration with the KOF economic institute, with Matthias Bannert in particular, to rebuilt the scrapping infrastructure. The coal is to scrape the FSO and other data providers daily or serval time a day, so we can make the data available in a format that is easy to read for computers. The scrapping process should also be accompanied by a rich set of test, so can get some really reliable data of it. This is certainly has the highest priority. Right now, we are still relying on the national banks API for quite a few series, which means they have a delay to actual publication. --- # Visualization Better Search Visualizing cross sectional data Visualizing time as time, i.e., play a video ??? At some point I would also like to see improvements in the visualization area. The search is still quite simplistic, and visualization cuold also allow for cross section comparison. Not sure if this can be done without making the UI more complicated, which is something I would like to avoid. Finally, a think that alway fascinated me about Gapminder what how the managed to visualize time as actual time. Not too clear where this could go, but definitely something that interests me. --- class: center, middle # .black[Thanks!] <br> <img width = "30%", src="img/logo.png" alt=""> <!-- [.black[christoph@christophsax.com]](mailto:christoph@christophsax.com) --> ??? So thanks for listening. Go to dataserie.org, explore the data, and let me know whats missing.