Faculty/Researcher Survey on Data Curation

Greg Janée <>
UCSB Library; Earth Research Institute; California Digital Library

James Frew <>
Bren School of Environmental Science & Management

doi:10.5062/F4PN93K4

Contents

Executive summary

In 2012 the Data Curation @ UCSB Project surveyed UCSB campus faculty and researchers on the subject of data curation, with the goals of 1) better understanding the scope of the digital curation problem and the curation services that are needed, and 2) characterizing the role that the UCSB Library might play in supporting curation of campus research outputs. The project received responses from one-third of the estimated target audience of 900, indicating great interest in the topic and yielding statistically significant results. To summarize the survey's findings:

These findings echo and complement similar surveys performed by other higher education institutions, including surveys at the University of Colorado, Boulder [4], California Polytechnic State University, San Luis Obispo [7], Georgia Tech [8], and University of Oxford [9], as well as the Digital Curator Vocational Education Europe (DigCurV) Project [2].

Taken together, these findings argue for the establishment of a campus unit possessing data curation expertise and providing curation-related assistance to campus researchers, and possibly hosting curation services (as necessary and as funding allows).

Survey design

The survey was intended to capture as broad and complete a view of data production activities and curation concerns on campus as possible, at the expense of gaining more in-depth knowledge. Thus the survey asked only five questions that could be answered in five minutes. Each question was multiple choice/multiple answer, and also allowed an open-ended response to be entered. Four of the questions and answer selections were specifically chosen to characterize the features of a future curation unit, and to discriminate between different potential development paths of that unit. One question was reserved for gathering rudimentary demographic data. Other questions that might have provided interesting data but would have yielded no clear direction for future development ("What file formats do you use?", "How much data do you generate?", etc.) were omitted for the sake of brevity.

PDF Complete survey instrument

The questions were:

  1. In the course of your research or teaching, do you produce digital data that merits curation?

    The intent of this question was to gauge the size of the data curation problem on campus. There are many ways that size could potentially be measured: by amount of data; by numbers of distinct datasets or data objects; by number of file formats in use; by numbers of research projects or funding grants; and so forth. For this survey we chose to measure size by number of researchers affected (and, indirectly, number of departments affected) because, for many curation services, the major cost of the service is directly correlated with the number of users. Additionally, users represent the interface points for service outreach and use.

    This question was yes/no, and a "no" response precluded responses to subsequent questions, including the demographics question. As a result of this survey logic we have no data on the demographics of the researchers for whom data curation does not apply; then again, it is unlikely that respondents would continue to fill out a survey they already considered to be inapplicable.

  2. Which parties do you believe have primary responsibility for the curation of your data, if any?

    The broad societal questions of who is responsible for data curation, who pays for data curation, and who performs the actual work of curation, are largely unresolved at this time. The intent of this question was to gauge who researchers believe should be responsible. The choice of the word "responsible" here was deliberate, as our intent was to focus, not on who is or is not handling data curation at present, or who should be doing the work of curation, but rather, who is ultimately responsible for ensuring that curation happens.

    Though the use of the adjective "primary" in the question wording might seem contradictory with a multiple answer question, many researchers did in fact indicate more than one answer.

  3. Are you mandated to provide for (or otherwise participate in) the curation of your data, and if so, by which agencies?

    Mandates are a relatively new phenomenon; for example, the National Science Foundation's requirement for data management plans dates only to 2011. But mandates are key and growing motivators, and this question helps us understand to what extent they will play a role in the future.

  4. What data management activities could you use help with, if any?

    Each of the answers named by the survey leads to a distinct activity on the part of a future curation unit.

  5. With which departments, programs, and ORUs are you affiliated?

    There are several ways that respondents could be characterized for demographic purposes: by discipline, by data type, by funding source, etc. We decided that departmental affiliation would be the easiest answer for respondents to provide that would yield useful data.

    This question was intentionally placed last on the survey. Given that it should be trivial for any researcher to record his or her departmental affiliations, the question doubly serves as a survey completion marker.

Following the above questions was a final opportunity to provide any additional comments (and many respondents did so).

Implementation

The survey was implemented online using SurveyMonkey. It was anonymous to allay any concerns over personal identification and thereby encourage participation. Web browser cookies were employed only to ensure receiving at most one response per survey recipient (technically, one response per web browser).

The survey was a blanket survey: the target population was all UCSB faculty and researchers. Identifying and contacting this population was actually somewhat difficult, as the University keeps no master record of researchers, nor is there a uniform mechanism for contacting them. Campus-wide mailing lists were eschewed as being too broad in scope (such mailing lists would also reach administrative staff, for example) and too duplicative (faculty typically receive the same campus-wide announcements via every departmental affiliation). To minimize duplicate emails, the survey announcement was sent through two distribution channels: the Academic Senate, which maintains a direct mailing list of all tenure-track faculty; and the Office of Research, which maintains a list of all ORUs (organized research units) on campus. In the latter case the Office of Research forwarded the survey announcement to the ORUs with a request to use whatever internal mechanisms they have available for contacting their respective researcher pools. This approach still resulted in some duplicate emails, though we believe the problem was minimized to the extent practicable.

Faculty and researchers were contacted via an initial email message. Three weeks later a reminder email message was sent. Another week after that, subject librarians within the Library performed targeted outreach to their respective departments.

The raw data was manually examined and refined before being subjected to statistical analyses. In some cases answers were changed when the intention was obvious (e.g., a respondent who manually entered "Computer Science" as a departmental affiliation, but failed to check the "Computer Science" box).

Availability of the raw survey data is subject to the approval of the UCSB Human Subjects Committee.

Plots and analysis

A few notes on interpreting the plots below:

The first survey question, which asked if the subject of data curation (and the survey itself) is applicable to the respondent, was yes/no and required. Only if the respondent answered "yes" could he or she continue through the remainder of the survey. As a consequence, for the first question only, percentages are relative to all responses received; for all subsequent questions, percentages are relative to the number of "yes" responses to that first question of applicability.

Because questions were multiple choice and multiple answer, percentages may (and generally do) sum to more than 100%.

In several bar charts below, multiple solid bars are contained within a larger, hollow bar. In such cases the hollow bar represents the union of the responses represented by the solid bars. Both the solid and hollow bars are plotted against the same vertical axis. Again, because respondents were free to select multiple choices, a hollow bar's constituent percentages may sum to more than the hollow bar's overall percentage.

Curation applicability

Responsibility

Help needed

Mandates

Demographics

Survey comments

At the end of the survey a space was provided where respondents could enter any additional comments, a sampling of which have been loosely categorized and included below.

Appendix: survey meta-analysis

References

  1. J. Scott Armstrong and Terry S. Overton (1977). Estimating Nonresponse Bias in Mail Surveys. Journal of Marketing Research 14: 396-402.
  2. Claudia Engelhardt, Stefan Strathmann, and Katie McCadden (2012). Report and analysis of the survey of Training Needs. Digital Curator Vocational Education Europe (DigCurV) project, April 2012.
  3. Robert M. Groves (2006). Nonresponse Rates and Nonresponse Bias in Household Surveys. Public Opinion Quarterly 70(5): 646-675. doi:10.1093/poq/nfl033
  4. Kathryn Lage, Barbara Losoff, and Jack Maness (2011). Receptivity to Library Involvement in Scientific Data Curation: A Case Study at the University of Colorado Boulder. portal: Libraries and the Academy 11(4): 915-937. doi:10.1353/pla.2011.0049
  5. Kristen Olson (2006). Survey Participation, Nonresponse Bias, Measurement Error Bias, and Total Bias. Public Opinion Quarterly 70(5): 737-758. doi:10.1093/poq/nfl038
  6. Bing Pan (2010). Online Travel Surveys and Response Patterns. Journal of Travel Research 49(1): 121-135. doi:10.1177/0047287509336467
  7. Jeanine Marie Scaramozzino, Marisa L. Ramírez, and Karen J. McGaughey (2012). A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University. College & Research Libraries 73(4): 349-365.
  8. Susan Wells Parham, Jon Bodnar, and Sara Fuchs (2012). Supporting tomorrow's research: Assessing faculty data curation needs at Georgia Tech. College & Research Libraries News 73(1) (January 2012): 10-13.
  9. James Wilson (2013). University of Oxford Research Data Management Survey 2012: The Results. Data Management Rollout in Oxford (DaMaRO) blog, 2013-01-03.

Notes

  1. Determining the minimum number of departments required to account for a given percentage of individual responses is a kind of "set cover" problem that is difficult to solve. The plot depicts a particular "greedy" strategy for covering responses that yields a solution that may be only approximately optimal. However, the correctness of two of the plot's three main assertions (that 3 and 8 departments are required to cover 25% and 50% of responses, respectively) was verified by exhaustive search.

created 2013-10-15; last modified 2013-10-16 18:51