Is an entirely new data modeling notation needed?

There have been a couple of discussions on unrelated threads regarding COMN (What is the relevance for data modelling in big data space? and Hey all.. I just want to know what are different approaches you follow for designing a conceptual , logical and physical model?). I am concerned that those threads are getting hijacked for vigorous discussions on COMN. Therefore, I have copied the relevant posts here and am posting my responses to those posts here, so that those threads can return to their original topics.

Reaan Botha said: Theodore Hills Chen's original notation represents the concepts of ER modeling directly. Software packages like those you mention tend to conflate concepts and abuse ER terminology; they usually confuse entity sets with tables and relationships with foreign key constraints. Try drawing a ternary relationship in one of those applications. To understand ER, I would recommend studying Chen's original papers, which describe how entities and relationships map to tables.

I share your interest in the confusion about types, classes, objects, states and values, but I'm not sure the current discussion is the place to unpack it. I'll just say that in my opinion, in information systems, classes and objects are best used for constructing system components, state machines, stateful services, and ADTs, not for modeling data.

Theodore Hills says: Reaaan, I have studied Chen’s original papers, and I agree that the mainstream practice of E-R logical data modeling does not in fact follow Chen’s original ideas. I am most concerned with the confusion of relationships with foreign key constraints. In fact, analysis of E-R notation and its standard interpretation shows that an E-R relationship which maps to a foreign key constraint is in fact not a relationship at all, but rather a subtype definition. See my book, NoSQL and SQL Data Modeling, Chapter 15, Relationships and Roles.

I also agree that classes and objects are best for constructing system components. However, I think we need a way to express classes, objects, entity types, data types, and data structures in compatible terminology and symbology, because database management systems, and the applications that use them, are built using classes and objects. Without unified modeling and terminology, we will continue with the current situation where programmers and data modelers don’t understand each other and talk past each other, and where the mapping from data to objects is primarily in the minds of implementers, not something that can be seen, studied, and engineered.

Clifford Heath said: Theodore "E-R logical metamodel has always lacked the concept of array types and composite types." Though it does not include them, neither does it exclude them. These are simply things that exist outside of the logic; a logical model can say everything it needs to without including them. Now if you want to use your "logical model" as a place to dump things that aren't part of the logic, go ahead; but it's unfair to criticise the diagramming syntax for not accommodating you.

My point is that both the ideas of an ordered list of things of similar type (an array) and of a finite set of associated things of various individual types (a composite) are things that are handled perfectly well by an E-R model. When you start wanting to say "store these things together in a particular kind of arrangement" then you've left the logical model behind and are firmly in the world of the physical. Computer programs need those things, but logical models do not.

Theodore Hills says: Clifford, as I understand E-R notation, there is no way to express that a single attribute of an entity type has an array type, nor is there any way to express that a single attribute of an entity type has a composite type. In E-R notation, an array or a composite type must be expressed by a separate entity type which is referenced by a key. The concise expression of array types and composite types is not for the sake of expressing physical implementations, but rather for the sake of concise expression of a logical model. Lacking such types, one needs to introduce unnecessary entities to a model. Occam’s Razor tells us that we should avoid such things, and efficiency of expression is a main goal of a logical model.

Clifford Heath said: Theodore "the concept of a composite type and the concept of an array are both missing from E-R notation. These are logical concepts" - in what logic? If you're going to call something "logical", you must be able to point to a logical theory that contains them. You can't just use the word "logical" to mean "it seems sensible to me".

Theodore Hills says: I think we interpret the “logic” of a logical model a bit differently. I think your view is informed by ORM, which does in fact put emphasis on logic per se, through the provision of logical operators for illustrating predicates that express relationship types. But in both ORM and E-R, the phrase “logical data model” also evokes the notion of a model of data that is independent of physical design considerations. It is in that sense that I see arrays and composite types as necessary logical types. In my logical model let me describe repeating values of a single type as an array, and let me describe a structure without a key as a composite type. If my physical implementation requires such things to be in separate tables, then let only the physical model show those additional entities.

Clifford Heath said: Theodore "lack of a symbol for an instance/individual in E-R and ORM" - actually ORM supports singleton types (and more) by use of an object cardinality constraint "there must always be exactly one POTUS", "there are exactly four cardinal points of the compass", etc. In E/R it's done with an informal note.

This is important: the logic must not bridge meta-levels. The schema (the "theory") contains types, and rules about the allowed relationships between instances of those types. It cannot also contain actual instances. The logical statements which form the schema are about types. A given set of instances might satisfy those statements, or might not; but the instances are firmly and irrevocably separated from the types. You simply don't have a logical model if you combine these meta-levels.

Theodore Hills says: Representing individuals as singleton types is unsatisfactory for the same reason: it depends on entities (namely, the singleton types) that are not necessary for complete expression, and would be unnecessary if the notation were not limited. Besides, I think you would agree that an individual and its type are two very different things. Why, then, would we choose to use a type as a stand-in for an individual?

Identifying something as an individual in an informal note means that individuals cannot be dealt with in forward engineering or reverse engineering of databases.

I know of no limitation on logical statements that they cannot reference individuals and must only reference types. From whence does this limitation arise?

Clifford Heath said: Theodore "a document that supports the nesting of documents within it. Such a composite entity has a composite type" - You're still being loose with the use of "type" here. "String" is a type - but it doesn't tell me much about what to expect - just a sequence of characters. Your "composite type" tells me that I have a set of nested boxes, but not which box is or should be inside which other box, or what else can be in a given box. Both these fit the description of "a type I use when I don't really want to use types".

"DBMSs that support storing XML documents support this, but I honestly don't know how I'd draw that" Graeme Witt and Daniel Moody designed a graphical notation which is very effective for this, and we have adopted it. They used it for showing XSD structures to domain experts. It fits very nicely into an E/R model. Here's an example (not in context of a diagram, but you'll get the gist) <https://www.dropbox.com/s/8xzizn115lcp11f/CompositeExample.png>

Theodore Hills says: When I said I don’t know how I’d draw that, I meant in standard E-R notation. In fact, your several suggestions that things can be represented with this informal notation or that one make my point that E-R notation per se cannot express these things. That’s why I believe we really do need a whole new notation.

Pugazendhi Asaimuthu said: Theodore Hills, Arrays and complex types are physical implementation choices on ER Model entities and relationships. The goal of E-R Models is to depict relationships between instances of real life entities in a normalized depiction. Normalization process eliminates repeating groups by it's first stage, Normal Form 1. That prohibits arrays. Complex types are hierarchical implementation of entity relationships that are depicted very well in ER diagrams. Hope this clarifies that arrays and complex types are not a limitation of ER model. They just don't belong there.

Theodore Hills says: Pugazendhi Asaimuthu, it unfortunately widely believed that normal form 1 prohibits arrays. Date and Darwen have shown, in Foundations for Future Database Systems: The Third Manifesto, second edition (a book I highly recommend studying), that the relational model remains intact when we allow a relation attribute to have a type of arbitrary complexity, including arrays, composite types, and even other relations. We all accept that a relation attribute type might be a character string, but a character string is an array of characters. This door, you see, has already been cracked open. No one would seriously suggest that string-typed relation attributes violate 1NF, and yet strings are array types.

Picking on "But I am not sure one can insist without a proof that no graphical notation can express all aspects of a schema--although such a notation might be difficult to use for processes and human interaction." : Is a view and its definition a part of a logical schema ? (It certainly is a prominent database component that is visible to the database user. So I think that's a strong argument saying views should indeed be considered part of a logical schema.) So you show me what kind of notation you use to document views [and their definitions - which are algebraic expressions of arbitrary nesting] in a graphical language.

Like
Reply

"informed by ORM, which does in fact put emphasis on logic" - A logical model exists to support logic - statements that can be evaluated for truth. That's what "logical" means. ER is also based in logic. It's perfectly possible to express such statements about the structures that might be implemented as arrays without explicit support for arrays - and in fact such statements are clearer. For example, can an array have missing elements? Is the first element numbered 0, or 1 (or can there be negative-numbered elements)? A decidable logic must answer those things, not leave them to tacit assumptions.

Like
Reply

Theodore "my point that E-R notation per se cannot express these things. That’s why I believe we really do need a whole new notation". A new notation perhaps, but a "whole new" notation? You've given no reason for that. My issues is this: a richer notation requires more types of symbol, and every new type of symbol increases the difficulty of learning and interpreting the diagram. While it's true that an annotation is also a symbol, they are visibly secondary and so don't obscure the overall structure. COMN could have four different kinds of boxes all representing "Person" (or a person). How is a novice expected to understand that?

1. Algebraically domains/types are fully orthogonal to relations. This means a relation attribute type can be anything, including relations. In theory it could even be a of a certain (logical) schema, or even ANY schema. However, as Chris Date has often noted, not all domain types make sense. Some only make sense in derivation and manipulation (e.g. RVA's), while others make sense in schema design (algebraic domain types), and still others don't make sense at all (but are not prohibited by the theory). Atomicity is an aspect of the logical model, not the domain types. Physical atomicity (e.g. predefined/hardcoded domain types) can be defined however, but without san algebra representing the physical level of representation this discussion will be moot. 2. Notations, unless algebraic, are always lacking and limited, and can never be the goal or the judge on the expressiveness of a certain level of representation(e.g. logical). Notations support processes and human interaction with a schema, and hence are only supportive. Typical notions like there should only be ONE type of notation for any schema, a (graphical) notation should be able to express all aspects of a schema, Notation shapes are a primary aspect of any notation are very common but incorrect. 3. Understanding what drives each level of representation(the concerns of the level of representations). It's formal makeup(algebra) and it's model transformation(algebraic transformation) to another level of representation are far more important, and should be the starting point for any discussion. To be continued..

I'm not sure what else to say. I don't think you've effectively countered any of the arguments I put forward.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories