Processing the Census American Community Survey (ACS) used to be something that took us months, but over time, that time has been reduced to weeks, and now, days. With the new 2012-2016 data updated on PolicyMap last week, we thought it might be interesting to take a behind the scenes look at how ACS gets processed. This post gets wonky, but if you’ve ever been curious as to what we do, you might be interested in this inside look.
The first step every year is deciding what data we want to add from ACS. ACS has over 30,000 variables, all of which are useful to someone, but only a portion of which are likely to be useful to PolicyMap users.
Throughout the year, we keep an eye out for data people are asking for, or data that’s in the news. We also look to see what new data is available from the ACS. Every year, we meet to discuss what we might add this year. Instrumental to this discussion was Tom Love, PolicyMap’s Director of Business Development. Thanks to his time working with a large number of our users, he had a list of indicators that were in demand. From this list, we were able to add new indicators on poverty, disability, veterans, and military healthcare. Lauren Payne-Riley, who joined our team as a data analyst this year, really dove into ACS data and found a number of indicators we never thought to add, most prominently jobs by occupation.
There are also plenty of indicators we decide not to add. There are a number of reasons we might decide not to add data. It might be of use to too few people, it might be too confusing, it might be misleading to non-data experts, or it may be a very interesting indicator that doesn’t have much variation across areas. A while back, we considered adding ancestry data, but decided against it due to problems with the data.
This year, there were no significant additions to the data, but we’re excited for next year, when data on internet use is expected to become available at local geographies.
Turning Their Data into Our Data
Once we decide on the new indicators to add, we have to look through the Census’s Table Shells file to find out the location of each variable. Often, the ACS variables are so detailed that we end up needing to combine them together. For example, all the job occupation variables are separated into male and female indicators. There are probably fascinating insights to be gleaned by comparing male and female presence in various occupations, but for PolicyMap users, it’s likely overkill.
So we put together an internal table of all these new indicators, wrote metadata for all of them using information from the Census’s Subject Definitions document.
The Census helpfully distributes preliminary materials a week before the data release, including details on any changes to the data from the previous year. One change we didn’t anticipate was an overhaul of the ACS’s data on language spoken at home. This data contains the numbers of households that speak any of over 40 different languages, and whether they’re proficient in English. The languages listed in this table changed.
The first change that I noticed was that the variable for Yiddish and the variable for Other West Germanic Languages (which prominently includes Pennsylvania Dutch) were combined into a single indicator. (We joked that by combining Hasidic Jews and Amish into a single indicator, we’d have data on religious communities with black hats and big beards, who don’t use electricity some or all of the time.) Other languages were combined as well, while some new languages were added.
This meant we’d have to quickly prepare all new indicators for the changed language data.
December 7th, 2017, 12:10 AM
Now all there was left to do was wait for the actual data to be released. I waited up until midnight (kept awake by an exciting hockey game), and as soon as twelve o’clock hit, started hitting refresh on my browser like it was election night. The data didn’t appear, so I brushed my teeth, put on pajamas, and returned to see the data up. I downloaded it immediately.
Of course, it’s a lot of data, and it comes in highly compressed. I waited for all the data to download, started the unzipping process, and went to bed. I’m not the greatest sleeper, so when I woke up in the middle of the night, I took the opportunity to finish the rest of the unzipping so it would be ready by morning.
Data of Significant Import
The morning of December 7th, we had all the raw data in our file system, but it needed to be imported into our internal database. This is an all-day process which mainly consists of me pestering Dominic, our database architect, about why the data isn’t importing faster. (Answer: there is a lot of it.) There were also some surprises in the ACS geography tables, which says what data corresponds with what locations, but thanks to some discussion at the ACS Data Users Group, we were able to find a workaround to getting these tables.
Trust the Process
Then comes the most exciting step: processing. It’s SQL heaven. This is where all those detailed ACS variables become PolicyMap indicators. First we make counts, averages, and medians, then we use the counts to make percents and percent changes. We have steps to make sure all geographies are included, and then we bring in the 2000 and 2010 Decennial Census data, and the 2007-2011 ACS data (moving forward from the 2006-2010 data shown with last year’s 2011-2015 data). And then we send the data off to our web developers.
Once the data is processed, we check it thoroughly. Everything that shows up on PolicyMap must be correct. We want to make sure that we didn’t make any mistakes in processing the data, that the Census didn’t include any errors in their files, and that there aren’t any further changes to the data that we didn’t catch.
In the old days of PolicyMap, we would individually check every single indicator against the Census’s FactFinder. If our number matched their number, we were good. With thousands of indicators to check, this took time, and after we’d gotten the process down, didn’t tend to find any errors. We still use this method to check all our new indicators (like the new occupation data), but have turned to new methods for everything else. We do checks to make sure the data is consistent with itself, and reasonably consistent with last year’s release. As part of this validation, we found one large change that kept recurring, which had a sad, logical explanation: The number of World War II veterans is precipitously declining.
Just as everything was looking to be smooth sailing, we found a major unexpected change: The new language data, which we’d prepared as best we could, was only available at the national, state, and congressional district level; a major reduction in granularity from the tract data at which it was previously available.
We were faced with a decision: Update to the new data, or discard it and leave last year’s language. We went with keeping last year’s data; data at the local level has so much more utility than state and national data alone. So we then had to quickly rollback all the changes we’d previously made to accommodate the new language indicators.
It’s Go Time
By Friday night, everything was ready on our staging site. We made the decision to wait until Saturday night to make the new data live, to give ourselves a little more time to check everything out. After some more rigorous testing, everything looked good.
And that’s how ACS data gets on PolicyMap!