We Finished the Second Plane. Here’s What the Full Stack Actually Looks Like.
A technical update on the ISA-95 bottling line demo — three planes running, data governed end to end, and a downtime model that mirrors how real plants actually work.
In the first update, I showed a historian full of telemetry and an AI answering questions about microstops from live signal data. It was a proof of concept for the telemetry plane — “what’s happening, second by second.”
This update is about what we built next.
The transactional plane. The governed layer. The part of the architecture that transforms signals into facts.
Where We Are Now
The demo system now runs three planes end to end on two Raspberry Pis costing less than €200 combined.
Plane 1 — Machine Layer
A Python simulator running on a Pi 4, publishing directly to MQTT. It executes a weekly production schedule loaded from an Excel workbook via a FastAPI ERP shim. The schedule drives everything: order sequences, changeover types, CIP cycles, and three injected major breakdowns.
The simulator runs as a systemd service — starts on boot, restarts on failure, logs to journald. It’s not a script you run in a terminal. It’s a plant that runs continuously.
Ten microstop types fire probabilistically during production, each with a distinct signal fingerprint. Fill Stabilisation Wait (MS02) is weighted more heavily on large-volume SKUs. Cap Feed Stutter (MS04) fires independently of fill state. Each one maps to a defined pattern of signals in the historian — scale_stable = false AND fill_time_ms > expected, for example — that the AI can cluster and identify without any operator input.
Plane 2 — Telemetry
High-rate signals published to MQTT and stored in a TimeBase historian on the Pi 5. 64 tags, ISA-95 hierarchy baked into every tag name:
historian/Amarach/Crosshaven/Bottling/Line01/Filler01/actual_weight_g
The historian stores what’s happening. Continuous. Store-on-change. No schema, no governance — just the signal stream. This is where the microstop fingerprints live, and where the AI reads them.
Plane 3 — Transactions
This is what’s new.
Every governed event — OrderStarted, StateChanged, FaultRaised, MicrostopEnded, BottleCompleted — flows from the simulator over MQTT onto a separate topic plane. Node-RED subscribes, validates, classifies, and writes to PostgreSQL.
The separation matters. Historians are not databases. You cannot run OEE queries against a time-series store and expect the right answer. The historian tells you what the line was doing. The transactional store tells you what happened, why it happened, and how long it lasted.
The Node-RED Architecture
Four flows, each with a distinct responsibility:
Schema Init runs once on startup. It executes the full PostgreSQL DDL — all tables, all indexes, all seed data — using IF NOT EXISTS everywhere. The system is self-initialising. You don’t need a DBA or a manual setup step. Deploy Node-RED, point it at Postgres, restart it, and the schema appears.
Transaction Listener subscribes to transactions/+/+/+/+/# and routes by eventType. Each event type has its own handler function that builds the appropriate SQL and writes to the correct table. Orders land in orders. State transitions land in state_events. Faults, changeovers, CIP, and stops all land in downtime_events with the right classification. Bottle events are sampled 1-in-10 into bottle_events.
OEE Aggregator subscribes to the telemetry line_state signal and accumulates time buckets on the current open order. Every state change increments either planned_downtime_ms (for CIP and changeovers) or unplanned_downtime_ms (for everything else). It writes these buckets back to the orders table in real time. OEE% is never stored — it’s always calculated on query from the components.
Downtime Classifier runs every 60 seconds. It queries for any downtime events where is_assigned = false AND classification = 'OPERATOR' AND ended_ts_utc IS NOT NULL, packages them as an UnassignedDowntime payload, and publishes to MQTT for tablet pickup. This is how the accountability gap metric works — unassigned downtime is tracked as a live number, not a gap you discover at end of shift.
The PostgreSQL Schema
Six tables, all following the same naming discipline: snake_case, _ts_utc suffixes on timestamps, _ms suffixes on durations, is_ prefix on booleans.
shifts holds one row per shift slot (AM/PM) keyed on shift code and date. orders holds the full order lifecycle — planned quantity, good count, reject count, all OEE time buckets, and a generated total_count column. downtime_events is the unified stop log: microstops, auto-classified events, and operator-assigned stops all live here with a classification field (MICROSTOP, AUTO, OPERATOR) and a source field (derived, system, operator). downtime_reasons is the reference table, seed-populated on startup. state_events stores every StateChanged transaction. bottle_events stores sampled quality records with weight and torque.
The most important design decision: total_count is a generated column (good_count + reject_count STORED). You never update it directly. You never recalculate it. It’s always correct.
Three Types of Downtime
This is the model that makes the demo work.
Type 1 — Operator Assigned
When an unclassified stop occurs — the line goes to STOPPED with no known cause — a downtime_events row is created with is_assigned = false. The operator selects a reason from a tree on the tablet. Until they do, that duration appears as an “accountability gap.” The percentage of unassigned downtime is a real metric on the dashboard.
Recommended by LinkedIn
Type 2 — Auto Classified
When the system already knows the cause from the stop code — CIP, a planned changeover, a major breakdown with a fault code — the downtime_events row is inserted with is_assigned = true immediately. No operator input needed. The system classifies it from the event payload.
Type 3 — Microstops
Sub-120-second interruptions that never trigger a formal stop record. Derived from signal fingerprints in the historian. Never visible in the operator UI. The operator’s availability figure says 87%. The AI queries the historian and finds that 22% of the hidden loss is MS02 Fill Stabilisation Wait events on the 2L SKU. That’s the power moment.
Three types in one system. The operator sees Type 1 and Type 2. The AI sees all three.
OEE from First Principles
The formula is:
Availability = (Planned Production Time - Unplanned Downtime) / Planned Production Time
Performance = (Ideal Cycle Time × Good Count) / (Planned Production Time - Unplanned Downtime)
Quality = Good Count / Total Count
OEE = Availability × Performance × Quality
Everything in that formula is a stored component. None of the percentages are stored. When the definition of “planned” changes — or when you want to include or exclude microstops from the availability figure — you change the query, not the data.
This is not a subtle distinction. It’s the difference between a system you can trust and a number on a dashboard you can’t explain.
The Four Dashboards
Grafana (v12 on the Pi 4) has four dashboards, provisioned from file so they survive container restarts:
Order Summary queries PostgreSQL. Active orders with progress against planned quantity. Completed orders with yield, duration, and actual vs planned time.
Downtime Pareto queries PostgreSQL. All downtime events ranked by total duration, including microstops. Joined to downtime_reasons for human-readable names. The pareto includes everything — planned, unplanned, and derived. The accountability gap is a separate stat.
Fill Quality overlays historian data (fill weight deviation per bottle) with order context from PostgreSQL. This is the panel that demonstrates the historian-database join — signal data enriched by governed context.
Line Live Status queries the TimeBase historian via the Infinity datasource. Line state, speed, fill weight, torque. The historian’s ISA-95 tag structure means the Infinity queries are just URL-encoded tag names — no custom data transformation needed. And this is one tht I am struggling with, It is working then fails, so more digging to get this stable
What We Learned Building This
A few things are worth documenting.
Spending time defining the transaction payloads, is so valuable. Using B2mml as the base meant we had the data we needed when we went to create the application logic, and the dashboards
Transactions and telemetry are genuinely different problems. We tested routing everything through the historian. It was wrong — not inefficient, wrong. A historian has no concept of a validated fact. It stores what it receives. A transaction store stores what happened, after validation, with full context. Mixing them destroys the semantics of both.
Systemd is underrated for demos. The simulator and ERP shim run as services that start on boot, restart on failure, and write structured logs. The Pi is always running. No one has to SSH in and start a script before a demo.
The ISA-95 hierarchy is load-bearing. Every component — the MQTT topic structure, the historian tag names, the PostgreSQL schema design, the Node-RED routing logic — depends on the ISA-95 physical hierarchy being consistent and correct. It’s not documentation. It’s the primary key of the whole system.
What’s Next
The infrastructure is growing. The remaining work is in three areas.
A handful of bugs need fixing: Grafana’s Infinity query format for the historian, a fault code matching issue in the Node-RED downtime handlers, and SQL NULL handling in the OEE calculations.
The full Excel weekly schedule needs to run continuously at speed to generate a meaningful week of production data — orders, breakdowns, changeovers, microstop distributions — that tells a real story in the dashboards.
And then the MCP demo queries. The five questions every plant manager asks:
Each of those should return an answer in under five seconds from structured data. No report writer. No analyst. No custom query per question.
That’s the point of building the foundation correctly.
Everything is open source: 👉 https://github.com/SparkyLarks/ISA95-bottling-simulator
Hardware: Raspberry Pi 4 + Pi 5 — under €200 total.
Stack: Python, FastAPI, MQTT (MonsterMQ), Node-RED (FlowFuse), PostgreSQL, TimeBase Historian, Grafana 12.
Architecture: ISA-95 physical hierarchy, ISA-88 state machine, never store rolled-up OEE.