Considerations when using fixed width flat files as a source from Informatica PowerCenter.

Considerations when using fixed width flat files as a source from Informatica PowerCenter.

One of the main advantages of using an ETL software such as Informatica PowerCenter (IPC) is their ability to use multiple data sources from different contexts, ranging from flat files (comma-separated values, tab-separated values, character-separated value) to multiple Databases Management Systems (DBMS) such as Oracle, DB2, Teradata or MySQL, and combine them to many targets. 

When it comes to using a fixed width flat file as a source from IPC, the task unlike what it might look like becomes nontrivial. 

The IPC server takes into consideration the end of line control character. The End Of Line (EOL) control character is a sequence of characters designed to specify the end of each line within a flat file (every time you hit the ENTER key). As every nonsense fight between tech standards, everyone has their definition of correct. 

Linux/Unix uses the LineFeed (LF) character (or as an Escape sequence: \n) as an EOL character, instead, Windows uses a combination of two EOL characters, LineFeed (LF) plus Carriage Return (CR) (or as an Escape sequence: \r\n).

When the IPC server READS a fixed width flat file it reaches the EOL character to determine when a row ends and starts a new one, therefore the number of bytes between each record should be specified by the user as 1 (\n) in the Unix case and 2 (\r\n) in the Windows case, this should be declared in the "Number of bytes to skip between records" option within the "Fixed width properties" inside the Workflow (explained later in the post). Furthermore and most important, by default every time the IPC server examines any fixed width flat file it reads it as "Not line sequential" (records which don't have an EOL character).

An incorrect interpretation of the number of bytes required between each row at the source leads to losing information at the target. To demonstrate and solve these issues the following example uses two fixed width flat files, one ending with a Windows EOL characters (\r\n) called: EOL_Windows_file.txt, and other ending with a Linux/Unix EOL character (\n) called: EOL_Unix_file.txt.

Unix has the $file command which comes handy to determine the codification of any flat file quickly. For Unix files, you can read the "ASCII text" description, while for Windows files it'll show it as: "ASCII text, with CRLF line terminators", remember that CR+LF means \r\n as an escape sequence. 

Both Windows and Unix fixed width flat files have the same content, two unique rows, composed by the columns Name (10 characters), Last Name (15 characters), Address (25 characters) and Year (10 characters) as shown below.

Besides the $file command Unix also has the $od command, short for an octal dump, which enables the view of the EOL characters of both files. The figure below shows the output of the $od command for the Unix file, and how it was expected shows only the EOL character "\n" highlighted in blue with the octal number: 012.

For the Windows file, the $od command shows the EOL characters: \r\n highlighted in blue, with the octal numbers 015 and 012 respectively. 

With the objective to show the differences between the two kinds of EOL characters from Unix and Windows and the default EOL IPC server configuration we import the EOL_Windows_file.txt and EOL_Unix_file.txt file to the Source/Target Analyzer in the Informatica PowerCenter Mapping Designer and build a mapping/workflow for each one. 

The mapping with the Windows EOL characters it's called: m_Windows_fixed_width_flat_file, this mapping takes the 2 records from the EOL_Windows_file.txt file and writes it into another file called: Output_EOL_Windows_file.txt, with the same fixed width column configuration in the same path as the source file. The mapping is shown in the figure below. 

Hence the workflow wf_Windows_fixed_width_flat_file for the mapping: m_Windows_fixed_width_flat_fil, its created with the default configurations, only being modified the "Source File directory" path within the "Mapping" tab from the Edit Task properties; the workflow: wf_Windows_fixed_width_flat_file, its show in the figure below.

After the workflow: wf_Windows_fixed_width_flat_file its run, the Workflow monitor shows a session Status equals to Succeeded for the mapping: m_Windows_fixed_width_flat_file and no error is encountered. However, analyzing the Source/Target Statistics, we can see that only 1 of the 2 rows are read from the "EOL_Windows_file.txt" file

Exploring the Output_EOL_Windows_file.txt file we can see that its created with an EOL Windows character configuration due to the CRLF line terminators underlined by an orange line in the figure below and the Escape Sequence \r\n in the orange box. Nevertheless and most important, the columns Last Name, Address and Year are missing from the second record! This could lead to severe information loss. 

Why is this happening? Why the second row is not entirely ignored and why we only have the Name column? Well, first of all, by default the IPC server reads each record in a Not line sequential way (records do not have a row delimiter), so after reaching the number of bytes for the first record it takes (again by default) just 1 byte as a separator to start a new record. Our first record in the Year column of the EOL_Windows_file.txt file was declared with 10 bytes fixed width, so even if it has the EOL characters in the 5 and 6 bytes, it really ends at the end of the 10 byte, with the letter: "r" in the word: "Pedro"; plus 1 byte as a separator.  

So, ok, by default each record its read in a not line sequential way, How can I fix this? To be honest, it's pretty simple, it's diminished to one click; the real problem is/was where I should click? From the Workflow, make two clicks on the session and open the "Edit Tasks window", then make click on the "Sources" folder and in the properties section click on "Set File Properties."

It will open the "Flat File - Sources" window, click on "Advanced" and it'll open the "Fixed Width Properties - Sources" window, finally click on the box "Line sequential file format" option; automatically it'll uncheck the "Number of bytes to skip between records" and the IPC server will rely only on the EOL characters to distinguish between one record and another. Don't forget to click on apply in the Edit Task window. 

After doing the changes in the fixed width file properties and rerunning the workflow: wf_Windows_fixed_width_flat_file, the Workflow Monitor shows this time the 2 records from the EOL_Windows_file.txt file, as it should be in the beginning. 

When reviewing the last generated Output_EOL_Windows_file.txt, we can see that this time both records from the EOL_Windows_file.txt appear intact and also both EOL characters for each record.  

However, what happens with the source EOL_Unix_file.txt and the output Output_EOL_Unix_file.txt file after being generated by their Workflow: wf_Unix_fixed_width_flat_file? Just the same thing, the second row it's cut after the word "Pedro" missing information, however, funny thing, the output file has the same EOL characters CRLF as a Windows file, even if the source was a Unix file. 

 Also the solution it's pretty simple for the Workflow: wf_Unix_fixed_width_flat_file and it's the same for the Workflow: wf_Windows_fixed_width_flat_file, clicking on the box "Line sequential file format".

Final thoughts, Linux and Windows have two different End Of Line (EOL) characters, Linux has just one character (LF) and Windows has the character (LF) plus (CR) becoming: CRLF. The IPC server by default reads a fixed width flat file in a Not Line sequential way, taking just 1 byte between each record; if your source file has an EOL character, in order to be read correctly from the IPC server it should be enabled the "Line sequential file format" option within the session properties from the Workflow designer. 

Thanks for reading!

To view or add a comment, sign in

More articles by Óscar Azeem B.

Others also viewed

Explore content categories