Reverse Engineering an Undocumented Database

Rosie O'Neill

Published Aug 9, 2015

One of the most challenging problems you may face as a data professional is documenting a legacy, undocumented database. There are techniques and products that will make this job easier.

Whether you are re-architecting an existing application, building a data warehouse from legacy data sources, or just need to orient yourself to a domain of data, having a solid data model and dictionary of your data sources is indispensable. If you are lucky, those artifacts exist, but if not, what steps can you take to go about constructing them?

The early steps don’t necessarily occur in order, so start where you can and you might be able to circle back.

Step 1: Talk to your data architects

If you have enterprise data architects, start by talking to them about your requirements. They are most likely to have the big picture of the data in your organization. They may have an enterprise-wide data dictionary, conceptual models of the data, an inventory of data sources, applications, and interfaces, and requirements for any future designs you are tasked with.

Step 2: Assemble existing documentation

Suppose there is no data model. Are there application documents, such as requirements, design, processes, or screen shots? What about help documents, FAQs, or paper forms (yes those still exist!)? What are the input and output interfaces? Scheduled jobs? Stored procedures? Stored comments attached to physical database objects? Naming conventions and abbreviations? Lookup table contents? Gather these and read them. Take notes as you go, making a list of entities, relationships, and questions.

Step 3: Reverse engineer the database into a Modeling tool

I’ll assume you are working with a relational model accessible via ODBC, JDBC, or a native driver. For Microsoft, the SQL Diagrams feature of Enterprise Manager can be used as a modeling tool, and other specialized tools exist such as Erwin, Red Gate Dependency Tracker, Enterprise Data Architect, and E/R Studio. My favorite tool is Erwin, but it runs on Windows and is quite expensive. The best free tool I have found is SQL Power Architect, but only for simple pictures, not full-fledged RE.

Be sure to work with a copy of the database, not the real thing, as you will be making schema changes to assist in this effort later. If the database is populated, even better, but even an empty database is useful.

If the database in question has Schemas, these are good candidates for subject areas, but proceed with caution as they may be utilized to reflect application rather than data architecture.

Naming conventions may also provide a clue to how data can be grouped into subject areas.

Step 3: Analyze Foreign Keys

If the database makes use of foreign keys, the reverse model will reflect them. If you are working with a database that doesn’t use foreign keys, make the effort to create them. There are COTS tools that can help with this, such as Optim Data Discovery. You can also grow your own by using INFORMATION_SCHEMA, looking for common field names in the metadata.

If the database is populated, you can test your assumptions be creating a foreign key for a suspected relationship. Keep a spreadsheet for your attempts, noting the source and destination table/key pairs. If the foreign key fails due to constraint violations, then query the data to see how many violations occur and add this to the spreadsheet. If the failure rate is low, you probably have orphans or out of date reference tables, but these are still good candidates. Having an empty copy of the database allows you to create the keys anyway, which will be used in your reverse model.

Step 4: Analyze the Default Diagram

When you reverse engineer the database, by default you will see all tables in a single view or diagram. You may have a model in which almost everything links to a few key entities. If the number of tables is large, this will result in a very intense hub/spoke looking diagram, which is almost useless.

To make the model more digestible, start by selectively removing the most linked-to entities so that smaller subsets of relationships will be visible. For example, suppose almost everything links to a table called Customer. In the zoomed-out view of your model, all the less frequently occurring relationships will be obscured by the dominance of all those links to Customer. If you remove the Customer table from the model and use the re-draw feature of your tool, the diagram will adjust so that smaller subsets of relationships can be identified. Another common offender is a CreatedByUserID link from every table in the database to a User table.

Step 5: Iteratively Recreate the Model

Repeat steps 3 and 4 until the default diagram can be partitioned into meaningful subsets. These subsets become your subject areas. Analyze the actual data as necessary to assist in this analysis. When iterating, always save both your model and your physical database first, in case you overwrite one or the other.

If you prefer making model changes in DDL, start with the latest model, and forward engineer it into the physical database. Apply any DDL statements to the physical model, and reverse engineer it again. Be careful to bring in changes only, or you will lose all the hard work you did in the model. For example, in step 7 you create diagrams that would be lost.

Why would DDL be preferable? I like it when I have a large number of trivial changes to make that can be easily scripted from metadata.

Step 6: Add Back Foreign Keys

When done iterating, selectively add back any foreign keys you had to suppress to do your analysis. For analysis purposes, while I wouldn’t add back the ubiquitous CreatedByUserID, I would certainly want to have Customer relationships in the model.

Later, when creating diagrams, some tools will allow you to hide relationships on a particular diagram view if they only add clutter.

Step 7: Create Subject Area Diagrams

Now that you have grouped entities into subject areas, it’s time to create diagrams for each. My rule of thumb is that any model or model subset should be readable without a microscope on an 11x17 sheet of paper. You can control how many entities can fit this scale based on the display preference for diagram entities. In Erwin, you can display the table name, the primary key, all keys (foreign and alternate), or all attributes. You can choose to display datatype, nullability, and comments.

If I am trying to show a conceptual view, I choose tablename only, and I can include a lot of entities on a single piece of paper. For a specific subject area, I usually show all attributes (name only). Since reference or lookup tables often just have ID, Name, and Description, those aren’t particularly useful so I minimize those entities to show name only.

Similarly, if all tables have the usual housekeeping timestamps, I remove all those attributes from the database before I reverse engineer it, because on the detailed diagrams, that is just clutter.

If a subject area is large, I create multiple diagrams for it. For example, in an Electronic Medical Records database, I probably have a subject area called Patient. This includes subsets like Diagnosis, Medical History, Medications. If these are not subject areas themselves, create specific diagrams for them within the Patient subject area.

Step 8: Color Code Entities

Color coding provides a visual cue of the entity type. This is not domain-specific groupings - that is what subject areas and specific diagrams are for. Instead, these are more general. I use Reference (for lookup tables), Main Entity (for core business entities). Business Data (for supporting facts and relationships about main entities), Event (for dynamic/behavioral data), Processing and Control (for workflow and other application data), Document (for physical and electronic forms, letters, etc..), Roles and Permissions, Communication, and others if appropriate.

Use enterprise architecture guidelines if available.

Limit to 10 or less to avoid information overload, and create a legend for your codes.

Step 9: Add Definitions

If you have an existing dictionary, import it into the model. Depending on the format, there are variety of ways to do this. Comments that are instantiated in the database will automatically be imported during reverse engineering. If not, you can format your comments in Excel by table/column, and these can possibly imported into the model with tool support. Some tools will allow you to program against the metamodel, or import a spreadsheet directly. You can always program SQL statements based on your spreadsheet and attach comments to the physical database. Then reverse engineer again to pull in those comments. (Again - don't forget to import changes only).

In SQL Server, system stored procedure sp_addextendedproperty() is used to attach comments to an object (in this case, table or column). You can use a spreadsheet formula to generate the SQL, or import the spreadsheet and write a SQL script to do it.

If no dictionary exists, make the minimal effort to create definitions at the table level. If there is time, go to the attribute level, using your gathered artifacts to assist. Talk to users, business SMEs, the DBA, programmers, analysts, whomever is available to gather information. Actually, if these experts are available, talk to them much earlier in the process as you conduct your analysis.

Of course, you don’t have to wait to add definitions. Collect and record them as you make discoveries along the way.

Additional Hints

I often rename objects for clarity, unless the model is destined to be the official data model. In that case, I make changes only in the logical model, or depending on the purpose of the model, I might not do this at all.

I think Hungarian notation is useless, so if my tables are all called TBLThing, I rename them to Thing. I correct misspellings to assist in future metadata analysis. I add role names to foreign keys if it adds clarity. For example, if a customer order has AddressID1, and AddressID2, for shipping and billing, I would rename them ShippingAddressID, and BillingAddressID.

I create a subject area model. By this, I mean a model where the “entity” is a subject area, and the relationships are between subject areas. This is easy to do in Visio. This is like a very high level conceptual model.

I sometimes create additional documentation, such as DFDs, System Context Diagrams, and State Diagrams.

If your objective is to target a warehouse or mart, you might also consider adding data lineage. Some tools allow you to document this in the model itself. Questions to consider are:

What is the source of the data?
How and when is it populated?
What transformations or calculations get applied to the data?
What is the granularity of change data capture? All DML, nightly, weekly, or monthly snapshots?

How much effort you spend on these analysis tasks depends on your constraints and requirements, of course, but if you really need a fully documented model, these steps will help you get there.

Reverse Engineering an Undocumented Database

Rosie O'Neill

Step 1: Talk to your data architects

Step 2: Assemble existing documentation

Step 3: Reverse engineer the database into a Modeling tool

Step 3: Analyze Foreign Keys

Step 4: Analyze the Default Diagram

Step 5: Iteratively Recreate the Model

Step 6: Add Back Foreign Keys

Step 7: Create Subject Area Diagrams

Step 8: Color Code Entities

Step 9: Add Definitions

Additional Hints

Others also viewed

From SQL Mastery to Data Warehouse Woes: Navigating the Complex Transition in Data Architecture

What is 'Data Profiling' !

Change Data Capture in Snowflake via Streams

Zero Copy Cloning - Snowflake

Data Architects could be tomorrow’s most prominent technologists and this is what they might be thinking about (Part 1): Data Integration

Data Architecture Practices: Driving Clarity in Your Business Data Foundation

The Hardest Part of Data Architecture (And It’s Not the Tech)

Data Modeling

Explore content categories