Historical Census Rescue Project for Social Science Research

Statement of Work for FY 2001-2004

 

UC Data Archive & Technical Assistance

and

California Digital Library

BACKGROUND:

Between 1973 and 1996 Lawrence Berkeley Laboratory (LBL) amassed an enormous collection of numeric social science statistical data used for government planning purposes by the U.S. Departments of Labor, Energy and Army Corps of Engineers. This rich data collection is in imminent danger of extinction. The collection includes numerous invaluable historical electronic files (such as 1960 Census summary information and 1970 Census tract digitized boundary files) found nowhere else in the country.Support for the database at LBL ended in 1998 and the last remaining computer upon which this data resides is being kept alive at the Bureau of the Census; this computer (a 1980ís vintage Digital Equipment VAX computer) could cease operations at any time. The UCB Library and UC DATA have been working to rescue and preserve this data and make it available to researchers and UCB and elsewhere through the Government and Social Science Information (GSSI) web site.††† Data of this type has been widely used on the UCB campus for research in Demography, Political Science, City and Regional Planning, Agricultural and Resource Economics, and Epidemiology.

 

SCOPE OF PROJECT:

The UCB Library and UC DATA propose a two phased effort to rescue of 1970 and 1980 Census data, as well as 1960 county level data and historic census tract boundary files.

 

Phase I: Convert data from original compressed format, archive and make available for electronic storage and transfer.

These data were stored in a unique compression format by Lawrence Berkeley Laboratory during the 1970ís and 1980ís.The Historical Census Rescue Project will decompress these data into ASCII format for ftp access at the GovDocs section of the University of California Social Science and Government Data Library web site.

 

Phase II: Install data for web-accessible retrieval, display, and selective downloading

Data extracted and archived will be installed into a WWW interface for selection by area and data item, then retrieved and displayed as html tables, with option for downloading in most modern data formats (e.g Excel spreadsheets).Then users will be able, for example, to select all census tracts with Latino population greater than 15% and display or download family composition median family income statistics. This phase will require some computer programming effort on the part of UC DATA and The UCB Library.We expect to utilize software already developed by the California Digital Library Counting California project which covers 1990 and 2000 Census data for California counties and cities (http://countingcalifornia.cdlib.org)

 

PHASE I: CENSUS FILES TO BE DECOMPRESSED AND MADE AVAILABLE VIA FTP

 

Priority 1) 1960 Census of Population

This file has 1000 subject matter cells for each county.Geographic coverage is: counties as designated by 1960 census county codes and 1960 FIPS county codes.

 

Priority 2) 1980 Census of Population

  • Summary Tape File 3:The tabulation contains over 1000 items from the sample tabulations .Geographic coverage is: States, Counties, Places, Minor Civil Divisions, Census Tracts (in MSAs) and Enumeration Districts and Block Groups (for the entire nation).
  • Summary Tape File 1:This tabulation contains 300 items on population and housing from the 100 percent forms.Geographic coverage is: States, Counties, Places, Minor Civil Divisions, Census Tracts (in MSAs) and Enumeration Districts and Block Groups (for the entire nation).

 

Priority 3) 1970 Census of Population

  • Second Count:This tabulation contains the most complete age and race breakdown of the 1970 Census.Population counts are available for 5 race groups in single year age cohorts (0-1 to 100+).Geographic coverage is: States, Counties, Places, Minor Civil Divisions and Census Tracts.
  • Fourth Count: This tabulation is the most complete sample data available at the subcounty level.The data has 1178 subject matter cells repeated for 5 racial groupings. Geographic coverage is: States, Counties, Places, Minor Civil Divisions and Census Tracts.

 

Priority 4) 1970 Census Tract Map Boundary Files

During the 1970ís Lawrence Berkeley Laboratory digitized the map boundary files for 1970 Census tracts in Metropolitan Statistical Areas.This boundary file was utilized in production of the 1970 Urban Atlas series, a joint project between the Bureau and the Department of Labor.About 35,000 polygon vector lat-long records are available.

 

Priority 5) Other Datasets:

Other datasets which should be actively considered for rescue are:

  • County Data book time series 1947-1977 for most counties in the US;
  • Mortality and Morbidity summary data by county;

 

DOCUMENTATION:

Each dataset will need to have a human-readable version of the following associated documentation:

 

  • Data Dictionary: All datasets in the SEEDIS system had a data dictionary referred to as a Data Definition File (DDF).The DDF specified the data element name, type, length, number of decimal places as well as prototype output headers to be used when displaying the data element values
  • Help File:The Help file was an eye-readable version of the data dictionary which organized the data elements by subject with a table of contents and (occasionally) an index.The format of these help files is similar to the UNIX troff data structure and will be reformatted into modern dictionary and word processing formats.

 

PHASE II: ACCESS SOFTWARE AND METADATA:

UC DATA proposes to develop software which will facilitate interactive geographic area selection and display of data as tables.All metadata which describesthe data will be created or converted to XML DTD specifications of the summary tables according to an emerging documentation standard for social science data under development at University of Michiganís Inter-university Consortium for Political and Social Research..With the application of XML, XSL style sheets will be developed for multidimensional table display.We would expect to utilize the California Digital Library (CDL) Counting California projectís collection of database and SAS programs for the task of data request and display, and thus avoid original software development as much as possible.However, the project will require a half-time programmer analyst to adapt the CDL software to retrieve data for sub-county and sub-city geography and design and load new databases accordingly.

 

PROJECT MANAGEMENT: Project management will be under the direction of Dr. Fred Gey, assistant director of UC DATA, worked at LBL for many years as database manager for the LBL data collection, until coming to the UCB campus in 1989. Dr. Gey is familiar with the unique LBL data formats.He directed a data rescue effort which recently made available 1970 Census data for nearly 300,000 areas (about 90 million pieces of data).He will be assisted by principal data archivist Ilona Einowski, director of User Services at UC DATA, who has worked on the Counting California project as an XML content expert.