Abstract
Big data (BD) is of high interest for research and practice purposes because it has the potential to provide insights into the population served and healthcare practices. Much progress has been made in collecting BD and creating tools for big data analytics (BDA). However, healthcare organizations continue to experience challenges associated with BD characteristics and BDA tools. Utilization of BD impacts current decision-making, planning, and future use of artificial intelligence (AI) tools, which are trained on BD. This qualitative study focused on better understanding the reality of BD and BDA management and usage by healthcare organizations. Six structured interviews were conducted with individuals who work with healthcare BD and BDA. Findings confirmed the known challenges associated with BD/BDA and added rich insights into the structural, operational and utilization aspects, as well as future directions. Such perspectives are valuable for education and improvements in BD/BDA management and development.
Keywords: big data, big data analytics, health records, digital data, population health, artificial intelligence
Introduction
The implementation of electronic health records (EHRs) and widespread information systems and applications for providers, consumers, and other parties have led to tremendous growth of electronic health data. The current sources of data include mostly textual content, which can be structured, semi-structured or unstructured. They also include videos, audios, and images that constitute multimedia. They can come from a variety of platforms such as machine-to-machine communications, social media sites, sensor networks, cyber-physical systems, and Internet of Things (IoT).1 These platforms begin to define big data (BD) because they make us think about size, volume, complexity, and heterogeneity of the data emanating every second from a variety of devices.
BD arrived sooner than the development of appropriate and efficient analytical methods for its analysis. In addition to the structured data, BD includes massive volumes of heterogeneous data in unstructured text, audio, video, and other formats, and so is not amenable to the inferences of statistical methods that are used for analyzing numerical structured data. Unstructured BD requires new tools for predictive analytics. In addition, there is a need for computationally efficient algorithms to handle the heterogeneity, noise, and massive size of structured BD. These are ways to dispel and/or avoid potential spurious correlations.
Artificial intelligence (AI) and data analytics are top technology priorities as they capitalize on sustainability through data analytics and adaptive AI.2 For over a decade, Mayer-Schönberger and Cukier encouraged datafication of BD, where essentially, virtually anything is transformed into useful data (insights) by documenting, measuring, and capturing digitally.3 Van Dijck asserted that the future of BD and big data analytics (BDA) will lie with machines, where data will be generated, shared, and communicated among data networks.4 After a decade of progress, much of the structured and unstructured data stored in EHRs can be analyzed with the use of natural language processing (NLP) and machine language processing (MLP) algorithms, which can unlock the value of the text and galvanize the extraction of the hidden insights and connectors.1 Transforming unstructured text into real patient insights holds great potential for improving health outcomes. Use of AI and BDA for clinical and non-clinical applications in healthcare has great potential, however, the majority of healthcare organizations have yet to reach the full benefits of their BD. This highlights the need to better understand the status quo of how big data is being handled and analyzed by healthcare organizations. What are some of the ways big data is being used and what are some of the challenges faced by healthcare organizations when it comes to working with big data? A deeper dive into how organizations use big data, how much they invest in big data technologies, and what challenges they experience creates an opportunity to identify and share some best practices, as well as identify potential gaps. Where the findings are translated into real patient insights and where such knowledge fosters better health outcomes, there may be opportunities for positive change in terms of improving population health, addressing health inequalities, improving operations, and reducing healthcare costs.
Background and Significance
Big Data
BD refers to data sets that are so large or complex with high volume, high velocity, and high variety that they cannot be processed by traditional data processing software in a reasonable amount of time, thus, requiring advanced techniques and technologies for management and analytics.5,6,7,8 BD can be described by characteristics such as volume, variety, velocity, variability, veracity, and value.
BD is inherently defined by big volume.9 The quantity of generated and stored data is usually reported in multiple terabytes and petabytes – where a terabyte stores enough data to fit on 1500 CDs or 220 DVDs. A terabyte of data would store approximately 16 million Facebook photographs. The volume of data in healthcare continues to grow because information is increasingly gathered not only systematically in systems used by hospitals, pharmacies, laboratories, insurance, research institutions, or genetic databases, but also by numerous information sensing IoT devices used by providers, patients, and other parties. The size of the data is believed to account for its value as well as its potential insight. Volume-related challenges are related to storage and data management technologies.
The type and nature or the structural heterogeneity of the data describes its variety.9 Structured data, mostly tabular data, found in spreadsheets and relational databases constitute about 20 percent of healthcare data.10 Unstructured data includes mostly text, images, audios, and videos. Semi-structured data may or may not conform to strict standards and include textual language for Web data exchange, called Extensible Markup Language (XML), that deploys user-defined data tags to make them machine readable. BD variety becomes even more complex given the diverse sources and formats, requiring that data from those sources be connected, matched, cleansed, and transformed.
At the heart of big data is velocity, which measures the rate of data generation and the speed at which the data is analyzed and acted upon to meet the demands and challenges that lie in the path of growth and development of organizations.9 Smart phones, digital sensors, and other devices, using mobile apps produce enormous and useful information about customers (or patients) that include geospatial location, demographics, buying and viewing patterns, and even physical activity or other health indicators tracked by mobile apps. These types of data can be analyzed in real time to harness real-time intelligence.
Another dimension of BD is variability, which implies the inconsistency or variation in the data flow (whereas velocity shows periodic peaks and troughs).9 Variability can hamper processes that manage BD.
Veracity reflects the “truthfulness” of data and was added as BD characteristics by IBM, given their specialization in removing and replacing BD errors.11 Addressing the imprecision and uncertainty becomes relevant for BD because of the inherent unreliability in certain data sources. The quality of captured data may vary tremendously, thus affecting the accurate analysis and results.
Lastly, BD is generally associated with value, which means that when large volumes of BD are analyzed, it is possible to extract high value from them.8 The original form of data has low value, but the information identified through its analysis can make a difference in its value. For that to happen, data should be relevant and of high integrity.
Big Data Analytics
BDA involves the analysis of BD. It is during this process that the value of big data for decision support and business intelligence is realized. Given BD characteristics, BDA cannot be derived by simple statistical analysis.12,13 In fact, use of advanced BDA tools and extremely efficient, scalable, and flexible technologies are necessary to efficiently manage and analyze the substantial amounts and variety of data.1,14 Technologies such as NoSQL Databases, BigQuery, MapReduce, Hadoop, WibiData, and Skytree have been in use for more than a decade.15 AI tools such as Microsoft Power BI, Microsoft Azure Machine Learning QlikView, RapidMiner, Google Cloud AutoML, or IBM Watson Analytics are offering greater value in BDA. For example, Microsoft Power BI was successfully used to detect specific antenatal data for babies small for gestational age (SGA) and monitor them through a dashboard, thus allowing clinicians to intervene and plan delivery as necessary.16
BD management entails both the processes and the associated technologies that allow for the acquisition, storage, and retrieval of data, which can be done in three stages: acquisition/recording; extraction, cleaning, and annotation; and integration, aggregation, and representation.17,18 Analytics involves the techniques applied in analyzing and acquiring intelligence from BD and can be completed in two stages: modeling and analysis; and interpretation. It becomes imperative that processing and management should be efficient enough to expose new knowledge in a timely manner, which is crucial for capitalizing on emerging opportunities, in providing a competitive edge, as well as rich business intelligence used to differentiate the organization, increase visibility, flexibility, and responsiveness to environmental changes.19,20,21,22,23 The allure in healthcare BDA is the ability to examine and apply the patterns that emerge from various and vast amounts of healthcare data to predict trends in population health and ways to improve it, while limiting costs. BDA benefits are already visible in reduced administrative costs, improved clinical decision support, better care coordination, reduced fraud and abuse; as well as improved patient wellness.24 Adoption of mHealth, eHealth and wearable technologies will push the increase in BD volume. Increased integration of such data with EHRs, imaging, patient generated data, or sensor data create even greater opportunities to leverage BD in healthcare.
Much of the BD and BDA research demonstrates success in use of BD and BDA tools such in monitoring SGA babies, response to COVID in Taiwan, or use of BD in mental health care.16,25,26 One study also highlights issues with big data privacy [27] (Golbus, W Nicholson & Brahmajee 2020.)27 Other studies help in understanding BD and BDA concepts through reviews, analyses, and summaries.19,28,29 In our study, we focused on the healthcare organizational structure regarding big data, the approach in integrating big data into operations, issues and challenges experienced, and the vision for BDA. Our research question was “How are healthcare organizations handling BD and BDA?” Better understanding of this reality serves not only to share best practices or challenges but also to inform decisions on resource allocation and opportunities for education of professionals to work with BD and BDA.
Methodology
The purpose of this study was to gain greater understanding on how BD and BDA are handled within healthcare organizations. To gain such perspective, the study evaluated experiences of professionals with healthcare BD and BDA. For this applied research, we followed the case study method, a qualitative research design.30 Case studies help explore an activity or process in depth and allow for detailed data collection through interviews of one or more individuals.31,32 The research was approved by the Institutional Review Board at Walden University.
The sampling strategy was purposeful and convenient. The research team focused on identifying individuals from various settings who worked with BD and BDA. Based on professional connections and LinkedIn profiles, we reached out to nine individuals in such roles (not all at once); over time, only six of them were available to participate in the study. We conducted six structured interviews with individuals whose main work was managing and/or analyzing healthcare big data. The interviews were completed virtually via Zoom and lasted between 45 and 60 minutes each. The principal investigator conducted structured interviews by following the pre-established interview protocol, which included an introduction to the study and researchers, verbal agreement to participate in the study, and questions in order, as presented below. Probes were also used at times to elaborate on some of the answers with further details and/or examples. The other two researchers were present during all interviews, recorded, and took notes. All interviewees were asked the following 11 standard open-ended questions:
- Can you please describe your role and how your organization’s big data team is structured for data collection and data analytics?
- What investments has your organization made to drive or support big data analytics?
- Can you briefly describe the types of questions your organization answers by using big data analytics?
- Can you briefly describe the types of decisions that are based on big data?
- What is your organization’s approach for integrating data analytics into operations?
- Sometimes a game changing opportunity arises, but the opportunity does not get vetted with evidence from the big data. Have you seen this happen in your organization? If so, can you give an example?
- How does your organization use big data to support population health?
- Now I’d like to focus on challenges in using big data. What are some of the frequent problems that big data analysts in your organization encounter?
- What are some solutions or approaches you have employed to overcome those challenges?
- Now, let’s talk about non-healthcare organizations that use healthcare big data.
- What are your thoughts on how device manufacturers, pharma, and insurance companies benefit from healthcare big data?
- What are your thoughts on how data companies such as Google, Amazon, and Microsoft benefit from healthcare big data?
- Finally, let’s talk about the future.
- What are your thoughts on how your organization will use big data in the future?
- Are there any new tools or resources your organization plans to use to improve the usage of big data and the experience with big data analytics?
- Given sufficient resources, what is your vision for an effective and efficient data analytics program in your organization?
After each interview, researchers discussed the main points that came out during the interviews. After the sixth interview, it was determined that the saturation point was reached, and no further outreach was made for additional interviews.33
The transcribed interviews were analyzed by using a summative content analysis approach. The summative approach focuses on identifying the essential aspect of the text and has been used successfully in analyzing interviews from healthcare professionals to examine complex text from diverse sources, including innovation in services or technology, which is similar to our research.34 This approach is also accommodating to differences (as opposed to only similarities), which is important in our study, given the diverse roles of interviewees and their experiences with BD and BDA.
Responses were coded based on the topics addressed through questions. Codes were aggregated into concept maps to group related codes into themes and show relations. While the use of standardized open-ended questions facilitated the data organization and analysis, some portions of answers that were provided under a certain question were moved to areas where they fit the topics better. For example, responses to questions 1 through 6 were categorized into: interviewee roles; organizational structure for BD and BDA; purpose of using BD and BDA; and dynamics/processes of using BD and BDA. The rest of the themes such as use of BD for population health, BD/BDA challenges, approaches in addressing such challenges, use of BD by non-healthcare organizations, and future directions were consistent with the questions asked. Another important note is that due to the diversity of the interviewees and organizations they represented, response analysis are mostly broken down by the type of organization.
Responses were coded by two researchers independently and discussed. No discrepancies were found, and 100 percent consensus was reached among the research team. All researchers engaged in recording, transcribing, discussing the text, identifying themes, key points, counting and comparisons of keywords and/or content, as well as the interpretation of the underlying context. Results of the surveys are organized and presented below.
Results
Six interviews were conducted with seven professionals who work with big data in different capacities and settings. To clarify the context of the results, where necessary, responses from interviewees that represented care provider organizations are discussed first, and responses from the quality management and the data platform representatives are summarized right after. Following are the findings from those interviews.
Interviewee Roles
Interviewee roles included the manager of healthcare data analytics at a large healthcare system in Pennsylvania, the chief research information officer at a university hospital in Ohio, the director of analytics and performance measurement along with a team member from a national quality organization in Virginia, a consultant and program manager at a private not-for-profit healthcare system in New Mexico, the senior director of engineering application at a large global data platform company in California, and the director of a data analytics consulting company in Missouri.
Organizational Structure for BD and BDA
Interviewees were asked about the formal organizational structure dedicated to working with BD, and they indicated that there is either a dedicated team/function, or department (such as a data analytics department) that is focused on working with health data. These teams were composed of business analysts, developers, data architects, engineers, clinicians, and occasionally health information specialists, and the size varied from a few to about 100 (the larger numbers correspond to larger health systems and the global data platform company). Additionally, staffing is done with internal employees and consultants. Consolidation of prior data analytics teams into one large function was mentioned by three of the interviewees. Despite the use of external resources, BD work is led and driven internally.
The way these teams function varies significantly, depending on the type and size of organization, as well as resources available. Two interviewees indicated that much of the BD work is conditioned by EPIC, the EHR used in the facility. In those cases, EPIC data and claims data are brought together into a common data governance platform. Physical servers are used, but cloud-based infrastructure is expanding.
How Are BD and BDA Used by Organizations – Purpose
Four interviewees shared that healthcare systems use BD and BDA to respond to regulatory requirements from the federal government, payers, or audit needs, as well as to fulfill executive and business unit requests. Requests mostly follow the industry trends and benchmarking, and a desire to stay ahead of the curve. One of the interviewees went into greater detail that BD and BDA are used to support optimal operations, shared saving, commercial contracts, Medicare shared savings, risk optimization, cost and utilization, as well as quality measures. Another interviewee shared that the organization uses BD and BDA to explore better ways of bundling services so that the facility does not lose money and possibly makes a profit to compensate for communities and services that are harder to pay for. A third interviewee shared that BD and BDA are used for predictive analytics around readmissions or to address questions pertaining to the health of communities around.
How Are BD and BDA Used by Organizations – Dynamics/Processes
Interviews revealed that the way BD/BDA are used varies from one organization to another. The care provider organizations that use EPIC had more in common. They capitalize on the templates and predictive models pushed by EPIC, given they run daily, and provide users with opportunities to act on the findings. Even when templates or models are not fully understood, there is trust in the vendor who provides the idea and tool. Often, such tools are integrated without a clear plan on how the information will be used, as in the case of a model that predicts the risk of a patient dying in the next year. Yet, three interviewees shared that some units have plans, or some have ideas about what they want but have no tool to develop it. Generally, the business side drives the types of analyses by telling IT what’s needed. IT explains what’s possible with the data and tools available. Results of BDA are used as a basis for operational and senior-level decisions, justification of investments, public health, care management, patient outreach, education, vendors, and for potential restructuring of the organization.
The interview with the individuals at the national quality organization showed a different process. Given that they are an organization that creates measures, ideas for quality measures are prioritized, and once decided, a technical expert panel defines the specifications for that measure. Then, the company uses the BD and BDA to apply specifications and test the measure for reliability and validity. For example, an opioid measure is tested, and then adjusted by removing certain populations, such as hospice or cancer patients. Measures are sometimes imposed by the Centers for Medicare and Medicaid Services (CMS), as well as driven by the National Quality Forum. Measures are often risk-adjusted for age, sickness, living location, race, ethnicity, and low-income status for Medicare. BD and BDA are also used to interpret clinical guidance with the data available. Lastly, they are used to maintain measurements; as clinical guidelines or literature review change, measures are re-tested.
The other distinct organization, the data platform company uses BD and BDA to assess how well the client company is using the data. They are able to trace and identify user-errors (as per regulations pertaining to data hosting services), identify faulty software, and use BDA to decide on how to prevent similar errors in the future. Such insight helps build better technologies to manage an organization’s data and test software as needed. Additionally, the company uses BD to understand product features, identify whether the product is working as it should, and proactively check quality of operations in the cloud platform and SAS platform.
How Is BD and BDA Used to Support Population Health?
When asked about how the organizations use BD and BDA to support population health initiatives, responses pertaining to care provider organizations had three areas in common: claims analytics; risk optimization; and quality measures. Claims data are heavily analyzed to identify opportunities for reducing costs and clinical variation, comparing utilization indicators, with peers, improving utilization and efficiency, as well as informing and supporting value-based contracts. One of the interviewees shared that geospatial analytics is also used to identify heat map areas in terms of cost-utilization for primary care facilities. Discussion on risk optimization was focused on better documentation of the level of risk, rather than BDA. Quality measures pertaining to the internal patient population are collected and reported. Additionally, there are efforts to understand the populations outside internal data sources. Depending on the request, the organization may include state or national level data that is publicly available. Two interviewees have community partnerships to address issues like health equity and social determinants of health. One organization uses the internal data available to make broad assumptions about the population (although access to the clinical data of that larger population may be limited or not available). Another organization is actively engaged with tribal leaders for outreach to minority communities and better population health management. The latter organization also performs spatial analysis and uses a geographic information system (GIS) and Microsoft platform, QlikView. Additionally, one interviewee shared progress in customizing a wellness program and diabetes predictive model for employees.
The interviewees from the national quality organization shared that they support population health tangentially by creating measures that drive incentives in the marketplace, which then drive health plans to manage population health and intervene as necessary. GIS or mapping algorithms are not used currently, but a machine learning algorithm would help identify the highest risk patients, or those most likely to be impacted.
The data platform company is mostly engaged in data collection exercises to understand peoples' behaviors and trends in relation to data. For example, spatial analysis is used to monitor air quality during California fires and decisions can be made accordingly. There is potential to build use cases software that help healthcare organizations not only monitor health data but also recognize patterns. Additionally, it was pointed out that there is potential to capitalize on data derived by sensors and IoT devices for better management of population health.
Challenges Pertaining to BD and BDA
When asked about the challenges observed in relation to BD and BDA, interviewees identified various aspects that are grouped into four categories: leadership; data literacy; system integration; and data characteristics. Challenges related to data characteristic are organized by volume, variety, velocity, veracity, value, and integrity.
Leadership-Related Challenges
All interviewees shared that organizational leadership is focusing on BD and BDA and dedicated teams (large or small) are in place. However, aside from the data platform company, others have yet to establish clear strategies, alignment of strategy with BD and BDA, and pathways for optimal BD use and collaboration within the various units and external parties. One interviewee said that there is lack of ownership of all required data sources to perform desired analytics, as well as lack of foundational infrastructure to support business needs. Another interviewee pointed out the leadership vacuum in certain areas. For example, in a university hospital, there are three important parties: clinicians; researchers; and administration. Clinicians are data generators, while researchers are data consumers. The administration follows the legal requirements: Family Educational Rights and Privacy Act controls teaching data, Health Insurance Portability and Accountability Act controls patient data, and Institutional Review Boards control research data. Tensions exist over the data management, and trusted relationships need to be developed among the three parties.
Data Literacy Challenges
All interviewees addressed that there is a need to improve data literacy across business operations. There are misinterpretations of graphs, and often, decisions are based on assumptions. There is a gap in translating business needs into what is possible to do with existing BD/BDA or how it could be possible. One interviewee mentioned that BD and BDA "is not something you can learn in a book. It is understanding what the data is telling you."
The data platform company shared that most users don’t use proper search terms, cannot do data analysis, or build a dashboard by using the SAS platform because they do not know the language to engage the platform. However, they are working on training users, as well as making the platform easier to use.
System Integration-Related Challenges
Five interviewees brought up that information is siloed. Integrating hospital clinical data with billing data, or claims data, or data from various practices is a challenge. People working with data question practices around patient duplication across the system or even proper physician identification in the multiple databases, given lack of proper integration. There are questions on how to index the data. As per one interviewee, “System standards exist but EHRs are customizable. For example, heparin control could be recorded in four different EHR locations in different organizations depending on the system customization. In the absence of guardrails, interoperability means relatively little; theoretically possible but it's pragmatically difficult because of choice.” There is concern that vendor competition and the market system in the US add to the challenge of integration.
Data Characteristics-Related Challenges
Five interviewees shared that there are challenges associated with handling large amounts of data. Not all organizations are provided with the equipment needed to analyze such volume. As per interviewee, “Using SAS in our computers or Optum landsite on a remote desktop can be limi