The goal of the The Human Project (THP) is to make it possible to do exploratory studies of human health and behavior on a large scale. This is different from the hypothesis driven studies that have previously been more typical of neuroscience and psychology. However, such studies have been incredibly useful in epidemiology and economics for identifying associations that had not previously been observed in smaller, more focused studies. Our choice of measurements and tools are guided by a consensus we are building among leading scholars regarding the factors they think will provide the greatest novel insight. Many of these are things that have only recently become easy to gather in an automated way, for example detailed financial transactions or electronic communication partners. Others are canonical measurements delivered in a new format that we hope to be able to use to connect back to the established literature. Taken together, these will provide a comprehensive view of our participants that can support a host of new discoveries.
The National Institutes of Health recently commissioned a national survey on public attitudes about the Precision Medicine Initiative (PMI), an effort to create a cohort of one million individuals who would provide genetic, medical and other forms of personal data as part of a national initiative to advance personalized medicine. There are a number of similarities between the PMI and the THP, chief among them the collection of sensitive personal information to build a resource for the use of scholars and clinicians. The results of this opinion survey suggest that nearly 80% of a diverse group of Americans think that such a study should be done, and that more than half would be willing to participate in it if asked. Of note, there was not too much stratification in willingness to participate, particularly along the lines of household income and race. However, younger people were a bit more willing to participate than older people, as were those with more years of education. The opinion survey also suggested what kind of incentives would be appealing to potential participants (money and health information were the top two), and that 75% of people would be willing to answer a question or a text to their cell phone from the study at least once a week, and nearly half were willing to respond even more frequently. And yes, maybe we're a little bit crazy, but the biggest scientific advancements come from once-thought "crazy" ideas!
A critical part of the HUMAN project is integrating environmental data into our detailed data about individuals. The Center for Urban Science and Progress (CUSP) at NYU already maintains unique access to New York City’s data flows through specific agreements with the municipality. Our goal is to capture relevant aspects of that dataflow, expand it to the national and international level, and embed it into the database in a way that makes examining the relationship between behavior and environment powerful and simple to accomplish. Of course, once we have built the platform for data collection and analysis for the study of New York City, we expect to be able to easily implement it in other cities that are working on urban data collections (e.g. Chicago and London) that are not currently quite as far along as CUSP.
We are aiming to enroll the first participant in mid-2017, and we hope to complete enrollment of all 10,000 participants over the course of three years.
Everything! Okay, that’s impossible, but our goal is to gather data that provides a comprehensive view of the lives of our participants, from their genomes and microbiomes, to their physical health, psychological traits and states, education, employment, financial status, detailed purchasing behavior, social networks, and physical location at a relatively fine granularity. The detailed list of measurements is currently being finalized, but it will contain a wide variety of data types that cut across traditional scholarly boundaries, ranging from medicine, neuroscience and psychology to economics and sociology. For each participant, we will do full genome sequencing, microbiome sequencing, perform a basic physical exam, psychological assessment, do social network and communication pattern profiling, take full histories of health, education, employment, and financial data, and then collect these data going forward. We will also gather location data at two-minute intervals and do a socio-political assessment (a broad category encompassing voting records, religious and philanthropic activity).
Some kinds of missing data are easier to track than others. For example, we will monitor the incoming data stream from participant mobile phones to look for disruptions. Study staff will follow up to determine the source of the disruption, and then, whether it’s a dead phone battery, a lost phone, or a vacation out of the country, they will contact participants to address the problem. However, there will also be missing data that will be more difficult to track and remediate. For example, a participant might choose to carry a second mobile phone to keep certain communications private. In these cases, we may not be able to do anything about it.
As we design the study, we are conscious of the fact that nearly all the technology that we use today will be replaced by improved tools that we hope to be able to implement to produce better quality data over time. To do this, we focus on what we plan to measure, rather than the tools or methods used to measure it. For example, we talk about gathering location data – today that means using technologies like GPS, wireless or Bluetooth. In the future, we still expect to be gathering location data, but using tools that are cheaper, simpler and more reliable. The key to combining these data sets over time will be to ensure that the accuracy and precision of all measurements are well documented.
By IRB regulations, children are a protected class, so they cannot be involved in any data collection that presents any risk, such as blood collection, since the results will provide no direct benefit to the children in the study. In addition, it would not be developmentally appropriate for children under ten to carry cell phones, so certain forms of data collection that depend on cell phones, such as location tracking and communication partners, will not be possible with the youngest children in the study. However, we will use wearable Bluetooth beacons to look at interactions between children and adults in the study, as well as still be gathering information about the home environment in which children reside and all of their education data. This will result in a very rich set of data for studying child developmental outcomes.
No – compliance for diaries is generally low, and studies of food diaries, in particular, suggest that this class of self-reports is relatively unreliable. We are hoping to gather as much data as possible using automated methods. For example, we plan to get permission from participants to acquire medical, administrative and financial records directly from the source. With detailed financial purchasing records, we will be able to determine how much people spend on groceries, as compared to restaurants or fast food. We will directly contact participants to provide some kinds of data, for which there aren’t really records, such as social networks and life stress levels. However, we will have a strict time budget limiting the number and extent of these interactions, and they will be game-ified to encourage participation.
Selected households will be recruited using standard social science methods, in which incentives are provided for gradual steps towards increasing the individual’s level of participation in the study (e.g. the Dillman Method, Dillman, 1978). We will start by sending promotional material to the homes of potential participants to introduce them to the The HUMAN Project. We will then follow up with contact in person, at which time we will offer people an incentive just to listen to a short (~15 minute) pitch about the project, and fill out a preliminary survey, without any request to actually participate in the THP. Finally, we will offer an additional incentive to encourage prospective participants to listen to an extended presentation inviting them to join the study. If, at any of the preliminary stages, we do not get a response from a prospective household, we will re-contact them a few weeks later to try and move forward with the recruitment process; another strategy that has been shown to increase uptake. With these approaches, we hope to get a large number of the selected households to agree to participate in the THP.
Households will be randomly selected using an address-based, multi-stage area probability sample design. Addresses will come from the US Postal Service Computerized Delivery Sequence File, in which there is an entry for every address in New York City that receives mail, and this will be used in conjunction with PLUTO, the New York City tax lot database. Additional data sources will be used to supplement this file. Given that the list, or study frame, will select households using a specific statistical method, the challenge will be to ensure that a large proportion of those households we select are ultimately willing to enroll in the study to avoid selection bias resulting differences between households who choose to participate.
We can only estimate the attrition rate for the study at this point, as no previous study has employed our planned technology to both engage and track participants. Large-scale studies such as the Health and Retirement Survey (HRS) and the Framingham Heart Study have relatively high loss of participants after the first interaction, but tend to be quite stable thereafter, as people begin to identify with the study group and become personally invested in contributing to the research endeavor. We expect a similar pattern of participation for the THP. However, in many longitudinal surveys, a main source of attrition is from people who move without providing a forwarding address. We expect that this will be less of an issue for us, as we will constantly be in contact with our participants via their cell phones and the location data they provide, so we can reach out directly to participants to update our records whenever necessary. However, some people will ultimately leave New York City. The outmigration rate in NYC is surprisingly low. Although thousands of people move out of NYC every year, it is actually only about 5% of the total city population. As a result, we expect that we would lose about that percentage of our study population to outmigration each year. However, even these levels of outmigration are not necessarily detrimental to the survey, particularly as so much of the data collection will be automated. We expect that we will still be able to follow those who move to the nearby suburbs or leave the area temporarily (such as elders traveling to warmer climates for the winter or students traveling to college), as we can make face-to-face contact with these participants without too much difficulty. Even those who permanently leave the New York City area could be at least partially retained, as long as they allow us to continue all passive and non-physical data collection. Only those who migrate out of the country will be completely lost from the study.
It depends on why they leave the study. People who resign from the study, leave the country, or pass away, will be replaced by new participants selected based on our dynamic model of New York City. However, we expect to be able to retain people who move out of New York City, but remain in the New York metropolitan area or return periodically (for example, college students or retirees who winter in warmer climates). For those who move permanently beyond the New York area, but stay in the United States, we may also be able to continue to gather most kinds of data, though there may be some limits to the amount of face to face contact and technical support that will be economically feasible.
We envision providing three kinds of incentives. We will likely give some amount of cash in exchange for time spent participating in THP activities, but finding appropriate levels of cash incentives can be problematic. For our lowest income participants, it might be coercive or compromise their eligibility for social programs, for our highest income participants, the amount of cash that would be motivating may be more than a reasonable budget can bear. Therefore, we also plan to incentivize people to participate in the THP by creating a sense of community and personal inclusion. Sending holiday cards, birthday cards, newsletters and updates about the findings of the project is a strategy successfully used by many longitudinal studies, from the Framingham Heart Study to the Health and Retirement Survey. Finally, we hope to use as an incentive the one nearly unlimited asset that we have – data! We plan to provide interactive visualization tools that allow project participants to play with their own data and aggregated data for the whole study population, and to have a data analyst who uses these tools to build interesting visualizations that any participant can view. We are conscious of the fact that allowing participants to see their own data could induce substantial treatment effects, so we plan to segment the feedback that participants receive such that the other participants serve as a control for any bias that may be induced by receiving feedback.
Right now, only members of the households that are selected can participate in the study, and these individuals will need to provide all the forms of data we collect if they want to participate in the study. Partial data is problematic because the places where data is missing are likely to be biased, and this makes it nearly impossible to conduct meaningful analyses with the data. For example, if people who live in the Bronx were less likely to provide data about cognitive performance, we would be unable study the effect of environmental variables on cognition across all of New York City. Or, if increasing age decreased the likelihood of being willing to provide information about social networks, it would be difficult to study longitudinal effects on social interactions. We need to ensure that we collect un-biased data, and that means that in order to participate in the study, members of the selected households will not be able to select which data sources they are willing to provide among the requested sources of information.
Privacy and Security
It will be critical to participants that we protect their data from third-party requests. Studies that collect sensitive data (for example, research on drug abuse or any health-related topic) typically use a Certificate of Confidentiality (CoC) granted by the NIH that allows researchers to refuse to disclose information in response to legal demands. The THP will apply for, and we expect be eligible for, a CoC and that this will protect the data from most third party requests. In fact, most of the data collected by the THP is directly available from the original source (for example, cell phone location data is easily subpoenaed from wireless providers, who are not protected by research regulations, and such requests are made and fulfilled on a regular basis). Thus, there is little incentive for lawyers to engage in a long process of obtaining indirectly collected records from us.
All personally identifying information will be stored separately from other, less secure data to prevent direct re-identification of participants. Further, scholars will only be provided with access to the data required for their particular studies, limiting the possibilities for recombination. However, given the detailed level of data collection, it will be impossible to fully anonymize participants – for example, recent work from de Montjoye and colleagues (Science. 2015 Jan 30;347(6221):536-9) showed that location data and purchasing data were sufficient to re-identify individuals in a completely anonymized database. Thus, we will rely heavily on the ethical requirements we place on scholars accessing the data – that they must not, under any conditions, intentionally or unintentionally, re-identify participants during the course of data analysis.
It will be standard procedure to provide some sort of intervention for any life-endangering finding. For example, if the medical personnel performing the physical exam at intake measure an unsafe blood-pressure level, they will be required to call an ambulance or provide a referral to a doctor, depending on the severity of the finding. However, for other kinds of data, identifying circumstances that require intervention will be complicated by the fact that much of the data will not be examined in real time. With such a large participant population, it would be too expensive for a staff member to check all results, so we will need to create automated systems for identifying test scores that represent situations where intervention may be necessary. This will be relatively straightforward for medical or cognitive testing with broad norms and accepted cut-offs for dangerous situations.
Several features of the data facility will enable researchers to access the data while preserving security. Data will go through a staging process as it is ingested into the facility for storage. Then, when researchers want to run queries or perform analyses, temporary specialized “data marts” are created which contain only the relevant data sets. Data can only move in one direction through the data facility, so that unauthorized access to the facility cannot be achieved from the staging server or the data marts; this is controlled by both electronic and physical means. Three features of the data marts enhance the security of the data. Each data mart is created for a specific project, and researchers have access to only the data mart(s) required for their own project for the amount of time required to perform the necessary analyses. After the researcher is finished, the data mart is removed by deletion (though the data itself always remains safely stored in the facility in its original form), so that it is not vulnerable to unauthorized access. The data marts themselves are thus heavily partitioned project silos. Other security measures include integrity checking of the data that comes off participants’ phones, computers and outside servers, detailed network segmentation, login and behavioral monitoring of activity on the network and stringent access controls (both physical and electronic).
Even with the most stringent security policies in place, phishing tactics have gotten very sophisticated, so human error leading to stolen credentials is a credible possibility (as seen in the Sony security breach). However, stolen credentials need not lead directly to a security breach of our systems, since two-factor electronic identification will be required to access the data facility. Even if password information is stolen, it will be useless without the second component required for logging in, for example, a USB stick with a secret token or a biometric.
We take the consent process very seriously, and have designed a three-step consent process to ensure that consent is truly informed. First, a video will be presented which describes the basic issues that require consent, as well as insights from secondary use cases for the data. The video will be divided into segments, and after each one a member of the consent staff will stop the presentation to administer comprehension checks and facilitate a discussion about the issues raised in the video. After the video, participants will be presented with the opportunity to provide oral consent. If they do so, the third step will be the presentation of the long paper form for participants to sign. This paper form will contain all of the legal language necessary for written consent, but all the information in the form will have been covered during the video and discussion process. Thus, we expect that the acquisition of written consent will be relatively straightforward and a somewhat incremental step of the consent process.
For reasons of security and quality control, the data won’t be accessible in real time, but we expect that it will be available to researchers soon after collection.