Hacking a nonlinear process for linear control: Part 2 of 3
Mid 19th century photo edited by Tommy Scharmann

Hacking a nonlinear process for linear control: Part 2 of 3

This is part two of a three part series:

  1. Introduction and identification of a nonlinear process
  2. Data collection, consolidation and cleaning
  3. Quantifying the nonlinearity and constructing the linearization algorithm from nonlinear data

----
Part 2:
Data collection, consolidation and cleaning
----

The verb "mining" has been applied to data for a number of years now and for good reason. The '49er's with their sluice boxes and pans were searching for value. In modern industrial operations there is much data. But, like the gravel, sand and rocks that plagued the gold miners of the Old West, most data has very little apparent value. Some data, by itself, does not provide actionable information about a process. When this data is correlated with another data stream you may well strike gold. Gold, for an automation professional, is data that nucleates actional information and ultimately leads to an algorithm realizing process control improvement.

Recognizing a process is nonlinear (the subject of Part 1 in this series) is the first actionable information obtained by comparing the controlling element data stream against the process variable data stream. Now we take the next step in preparation for creating our "linearizing algorithm."

Part 2 consists of three primary steps applied to data:

  1. Collect
  2. Consolidate
  3. Clean

--Collect the data--

For the purpose of this post an acidic process stream controlled by base injection is modeled. Since the subject of this post is nonlinear control, not pH control, details of pH can be left for another day. Charts B, C in Part 1 of this series, and the data to follow, were created by modeling the titration of 1000 ml of 1 molar hydrochloric acid with 1 molar Sodium Hydroxide. The reaction is:  

Reaction A

When raising an acidic solution with a basic reagent the curve takes an identifiable shape which you are already familiar from Part 1.

Chart D

Each dot forming the trend is one of 890 datapoints generated in the spreadsheet model. Even in the spreadsheet, the change in value is so sudden at the 1000 ml mark, the individual data points are discernable. As shown in Part 1, the process gain is greater than 100 at this portion of the curve.

If we "normalize" the above chart in terms of percent you get Chart E.  Notice Chart E's axis labels. The normalized chart can also serve as a simulation of the data that would be generated if the PID is placed in manual control and the controlling element (maybe a metering pump injecting sodium hydroxide into an acidic stream) is commanded from 0 to 100% in small steps.

Chart E

To highlight the differences between the two simulated data sets in the charts above:

  1. The data in Chart D is obtained via bench
    A sample of the untreated process stream would be collected, then, in a lab, the sample titrated with the same reagent used for treatment in the process, in this case sodium hydroxide. As sodium hydroxide is added a bench top pH probe and meter would monitor the pH and the data points would be recorded by the lab technician.
  2. The data in Chart E is obtained in situ
    The controlling element would be placed at zero percent (NaOH metering pump off or injection valve closed) and as the controlling element is sped up or opened in steps, the process is allowed to reach steady state for each step change, the steady state values of process variable and control element "position" are recorded.

Based on personal experience, #1 is the best option as it is unlikely permission will be granted for full control of a process and drive the process variable to such extremes. Normally, in a production operation, this is administratively impossible. If you do get a window to perform such a data collection sequence, the incoming flow or pH may change thereby confusing your results. Experience has taught me #1 produces the most reliable data for pH curve generation.

Since the data is modeled in a spreadsheet many data points can be created. As mentioned earlier, the examples above have 890 generated points. This, ofcourse, is not realistic for real world data collection on a bench top. 

A more likely representation of the amount of data collected in one iteration is shown in Chart F:

Chart F 

Chart F has 28 data points. In the model, the data points for Chart F were selected by assigning a pseudorandom integer between 0 and 100. If the associated random number was between some range (for this data point count, the range was 2, specifically, 49 to 51) then that datapoint was selected. In a real world scenario, despite best attempts, a seemingly random selection is not far from the data collection method  despite a technician's attempts to take data at some regular interval. The reasons for this are interruptions (phone calls, etc.), accidental overshoot when titrating, pH probe problems, etc. But the interval of data point recording is not so important. The importance is in the accuracy and precision of the data and its curve fit. To that end, it is advised to take multiple sets of data from multiple samples at varying times of the day across a few days.

Still, Chart F is too good to be true. The data points are all so nice and orderly. Above we talk about interruptions. Interruptions cause time noise. What about variable noise?

Chart G

To highlight the potential for noisy data, Chart G has the data from Chart F as the dark green larger dots and randomly selected data from the model with the addition of modeled instrument noise as the smaller off yellow dots. Considering the potential variability of data point collection, it is prudent to take multiple sets of data.

--Consolidate the data--

One thing I have found to be very helpful when dealing with data is visualization. The power of visualization cannot be underestimated. To that end, the first step is color. I have found that coloring data has helped to see relationships in ways I never expected. "Clean the data" consists of the following steps:

  1. Visualize:
    Using the spreadsheet of your choice, color the cells of the table for each data set. Create a single scatter plot of each of the data sets matching the color of the points to the data set cell colors.
  2. Combine
    Create a copy of your data sheet in the spreadsheet workbook to ensure your original data is unaltered. Keep the new sheet in your spreadsheet workbook and combine the data.

--Consolidate the data--Step 1 Visualize--

Simulating five sets of data produces five tables.  This data represents our version of rocks, gravel, and sand the '49'ers had to sluice. Looking at the data in separate tables does not afford any intuitive insights. Regardless, color the cell of each data point in the tables.


Table A

Table B

Table C 

 


Table D

 
Table E 

Plot each table as a scatter plot remembering to match the point colors to the cell colors.

Chart H

The plot allows us to quickly recognize that each data set produces a similar curve.

--Consolidate the data--Step 2 Combine--

Table F below is a single table of all the data combined and sorted by controlling element value in ascending order. Below the table is a screen recording of manually combining the data in Google sheets. The process is not much different in the various spreadsheet programs available. There are automated ways to perform such tasks using pivot tables and/or scripting For the sake of good demonstration I performed the combination manually.

Table F

Video A

--Clean the data--

(A step by step screen recording demonstration of setting up the spreadsheet for data cleaning is at the end of this post.)

Now that we have the data consolidated we need to clean the data.

  1. Sort the data by controlling element value in ascending order.
    Sorting data is trivial with a spreadsheet, so this step was accomplished during consolidation.
  2. Select the range, or bins, based on the controlling element.
  3. Average the controlling element point values within each controlling element based bin.
  4. Average the process variable point values inside each controlling element based bin.

The above procedure is similar to the creation of a histogram in statistics albeit here we are using two variables.

A bin size of 10 will result in 10 data points if at least one data point was collected every 10 percent of the controlling element's range. Referring to Table F this is indeed the case. Chart G below, with the inlaid table, is the result of a bin size of 10. 

Chart I

Since we are modeling the acid base reaction we "know" what the ideal value should be (refer to Chart E). Our calculated process variables, based on the consolidated data in Table F, are not duplicates of the ideal PV values. Of course, this will never be the case in the real world.  Besides, we will never know anything other than what we measure.

A bin size of 5 results in Chart H (remembering green is ideal and red is collected):

Chart H 

Notice that only 18 data points are produced, not 20. This is a consequence of missing values in our collection process from a CE range of 10% to 15% and again from 35% to 40%. Verify this in Table F.

As expected, a bin size of 20 yields 5 data points:

Chart I

The charts above were produced in a spreadsheet without the use of code, only formulas. In the spirit of facilitating rapid scenario production, a spreadsheet, if used properly, can allow for various "what-ifs" without the need for tedious and time consuming manual calculation. If you are familiar with the scripting language associated with your spreadsheet of choice, you can take the methods demonstrated here to a higher level of automation and speed of production.

As promised, below is a video of one way to create such a spreadsheet tool. This is how the above tables and charts were produced. The Google sheet will be made available for download once Part 3 of this series is posted.

Video B

Simple averaging of each bin is only one, and admittedly, a simple method of "data cleaning." Using modeled data with a random point selection and the addition of random noise and/or data biasing allows you to test other methods for finding what best fits your modeled data set. This can inform your decision on the best method for data cleaning a real world dataset. 

Stay tuned, Part 3: Quantifying the nonlinearity and constructing the linearization algorithm from nonlinear data will be posted soon.

Credits:

About the author:
A puzzle junky who thoroughly enjoys solving problems, especially when the solution involves an implementation of technology.

If you spot any typos, technical errors, missing links, graphics, etc. please let me know. I have no editorial staff, except you.

Do you have any questions? I am glad to help.

To view or add a comment, sign in

Others also viewed

Explore content categories