Risk with Blended Data

William Kosinetz

Published Apr 1, 2024

Managing and disclosing risks in blended datasets with new GPT machine learning technical advances in large language models can be challenging. Deception dataset files can increase the availability of data that influences information from informed to misinform. There is evidence about building on blended created machine learning datasets. Blended data is the combined sources of previously collected data streams along with new dataset sources being discovered. This Synthetic dataset is information that is artificially generated rather than produced by real-world events. Using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. This can improve the quality of analyses or misrepresent them.

Blending dataset sources then creating a new machine learning dataset raises concerns about protecting privacy, which may affect decisions about multi dataset access. Decisions about data blending involves, managing trade-offs between disclosure risks and data usefulness. The blended dataset lifecycle permits a level of data access commensurate with anticipated usefulness and acceptable risks. As a case in point, data blending often requires linking subjects from multiple dataset sources. Effective linkage may require identification, numbers, names, or other confidential fields to be shared across dataset owners or holders.

Characterizing usefulness of blended data can assist decision making. In particular, the topic of informed consent is fundamental to the discussion of machine learning blended data. Defining usefulness is particularly challenging when blended datasets are intended to be released as researched datasets. Recent changes to the legal infrastructure for statistical products have generated new opportunities for blended data, but also new gaps to be filled towards a national and a global data infrastructure. Promoting a common lexicon does not mean that privacy requirements should become inflexible or mechanized, rather standardized language would facilitate integration of privacy policy with technical approaches.

This framework begins with a simple but critical question for agencies considering a data-blending project. What do we want to accomplish with blended data, and why? Determine auspice and purpose of the blended data project. The auspice and purpose can have many implications for the data-blending project, from how datasets are assembled to how they are analyzed and shared. Thus, it is worthwhile for agencies to begin a data-blending task with the dataset’s ingredients of data fields that may come from federal agencies, state/local agencies, private-sector companies, or other parties. Once the ingredient files are identified, confidential policy requirements and the proposed disclosure limitation methods to be applied, need to be shared with and explanation to stakeholders before blending is attempted.

Before a blended dataset product is disseminated, it is prudent for agencies to develop and execute a maintenance plan before machine learning GPT/AI consumes the dataset. Changes may be made to ingredient in data files, such as error corrections or collection of additional information. Zombie datasets files that are owned by users that are no longer with us or the organization needs to purged the ingredients in the dataset. What happens when a machine learning dataset is deprecated for legal, ethical, or technical reasons, but continues to be widely used. Agencies can assess whether such changes affect the quality of downstream analyses sufficiently to justify re-blending the data, which could entail additional disclosure risks. The nature of the end products can also affect this decision. Moreover, burdening data holders with anticipating and managing these situations may disincentivize data sharing. In practice, managing the risks from compositions may necessitate research on new technical approaches, considerations or techniques for protecting confidentially and maintaining usefulness. As data pathology has shown data delusion is distinct from a belief based on false or incomplete information, confabulation, illusion, hallucination, or some other misleading effects of machine learning perception. Are Humans evolving from reasoning to algorithms or is algorithms replacing human reasoning? This is a great question to comptonplate.

To view or add a comment, sign in

Risk with Blended Data

William Kosinetz

More articles by William Kosinetz

Explore content categories

More articles by William Kosinetz

How technological advances are shaping the future

AI & Machine Learning on Generative Innovative Directions

Human VS GPT Rights

Human & GPT Behavioral Economics

The Changing Aspects of Leadership

My GPT Chat on Humanity

Implementing Enterprise Service Oriented Architecture

Opportunity & Challenges for 2019

Through the Looking Glass 2019

Bi Modal Technology Plan

Explore content categories