---
title: "Parsing POS Layout Files for Descriptive Names"
author: "Robert J. Gambrel"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Parsing POS Layout Files for Descriptive Names}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette will show how a Provider of Services report layout file can be
quickly parsed to extract the descriptive variable names it contains. POS
datasets from year 2010 and earlier have generic variable names like `PROV0001,
PROV0002, ...` that offer no insight into what the variable actually is. In the
Layout file, along with a data dictionary explaining the variable's values,
there is also a COBOL descriptive name. The `pos_names_extract()` function will
parse this file and return the descriptive names, in the order that matches the
variables in the dataset.

## Provider of Services Data

I have included a sample of the 2010 Provider of Services data for hospices. The
full 2010 file (along with many other years) is available [from the
NBER](https://www.nber.org/data/provider-of-services.html) and contains data
from other provider types as well.

```{r, eval = T}
library(medicare)
# load the package data
data(pos2010, package = "medicare")
names(pos2010)[1:10]
```

These variable names are useless, and with over 500 variables it is impractical 
to look up each one. Instead, we can parse the layout file to obtain useful
names. In this example, I have bundled the Layout 2010 file with this package, 
but I expect the user to have the downloaded text file that corresponds to each
dataset in use.

```{r}
# filepath should be changed by user
filepath <- system.file("extdata", "layout10.txt", package = "medicare")
names_2010 <- pos_names_extract(filepath, pos2010)
names_2010[1:10]
```

These are much more descriptive variable names and worth using.

```{r}
pos2010_renamed <- pos2010
names(pos2010_renamed) <- names_2010
```

Note that it is up to the user to make sure that the layout file is appropriate 
for the chosen data file. Each year's layout file is different, so each year 
must be parsed separately. The function checks whether the number of variables 
in the layout file and dataset match and whether the generic variable names are 
the same in both. It will stop if there's a problem. **If the generic names from
dataset 20XX are the same as in layout 20YY, the parsing should work, but won't
necessarily be accurate. CMS is not 100% consistent with variable naming across
years.**

```{r, error = T}
pos2010_short <- pos2010[, 1:500]
names_2010_short <- pos_names_extract(filepath, pos2010_short)
```

```{r, error = T}
pos2010_wrong_names <- pos2010
names(pos2010_wrong_names)[1:3] <- c("wrong1", "wrong2", "wrong3")
names_2010_wrong_names <- pos_names_extract(filepath, pos2010_wrong_names)
```


## Pre-compiled dataset names 

In order to same the user time and headaches of
downloading each year's Layout file, I have pre-compiled dataset names for years
2000-2010. These can be accessed via the `pos_names()` function. By looking at
inner variables, this also illustrates how the dataset layouts change over time:

```{r}
for (year in 2000:2010) {
  print(year)
  print(pos_names(year)[200:205])
}
```