A Python Script to Eliminate Duplicate Entries in TMX Files

TMX, or Translation Memory eXchange, is a standard XML file format used in the translation industry to store and exchange translation memories. The translation memories are stored in translation units. TMX files nowadays are also cloud-based, enabling people to cooperate and make changes asynchronously.

So one issue arises: When not properly synced, TMX can have multiple translation units for the same source text.

Today, I’m going to walk you through on how I created a script to clean up duplicated entries in TMX files with XPath.

Snippet of the translation unit in the example TMX:

    <tu creationdate="20220920T061441Z" creationid="SISI\sisil" changedate="20220920T061441Z" changeid="SISI\sisil" lastusagedate="20220920T061441Z">
      <prop type="x-ID">134892402/prop>
      <tuv xml:lang="en-US">
        <seg>How to eliminate duplicate entries in TMX Files with XPath?</seg>
      </tuv>
      <tuv xml:lang="zh-CN">
        <seg>如何使用 XPath 消除 TMX 文件中的重复条目？</seg>
      </tuv>
    </tu>

Key Components of the translation unit<tu>:

Key Attribute of <tu>

creationdate: The date and time when the translation unit was created
changedate: The date and time when the translation unit was last changed

<prop type="x-ID"> : A property of the translation unit, which can hold metadata related to the TU. Here, type="x-ID" is a custom property used to store an identifier. In this case, duplicated TUs have the same x-ID value. The <prop> element allows the tools to insert non-standard information in a TMX document.

TMX Deduplication Process:

1. Find translation units with duplicate x-IDs.

2. For translation units with duplicate x-IDs, remove them until there are only two of them. The remaining two should have the latest changeDate.

3. Move all the removed entries to a backupFile.

3 Key tools used for the script:

1. lxml for XML Parsing: lxml is the only Python library that I found works well for parsing TMX files. There are other libraries that you can look into such as Beautiful Soup.

2. XPath for XML Queries: XPath can be used to manipulate targeted data. I used XPath to target unique identifiers (for example in this case, x-ID).

3. XML Data Management with ElementTree: I used lxml's ElementTree to work with XML files.

Key Parts of My Script:

1. Importing Necessary Libraries

from lxml import etree 
import datetime

Import XML processing library lxml and the datetime module for handling dates and times.

2. Function to Retrieve File Paths from User

def get_file_paths():
    tmx_file_path = input("Enter the path of the TMX file: ")
    output_file_path = input("Enter the path for the output file: ")
    return tmx_file_path, output_file_path

A function that prompts the user to enter the file paths for both the source TMX file and the desired output file, and returns these paths.

3. Open and Parse the TMX File

    parser = etree.XMLParser(ns_clean=True)
    tree = etree.parse(tmx_file_path, parser)
    root = tree.getroot()

Initialize an XML parser and parse the TMX file, obtaining the root of the XML tree.

4. Count Initial Translation Units

    initial_count = len(root.findall('.//tu'))
    print("Initial unit count: ", initial_count)

Count and print the number of translation units in the TMX file.

5. Identify and Count Duplicate x-IDs

    x_id_count = {}
    backup_list = []
    for tu in root.findall(".//tu"):
        x_id_elements = tu.xpath(".//prop[@type='x-ID']/text()")
        if x_id_elements:
            x_id = x_id_elements[0]
            if x_id in x_id_count:
                x_id_count[x_id] += 1
            else:
                x_id_count[x_id] = 1

Create a dictionary to count occurrences of each x-ID and prepare a list for old translation units.

6. Handle Duplicates

    for stringId in duplicated_x_id_list:
        print(f"Extracting entries with duplicate xid {stringId}...")
        xPathSelect = f"//tu[prop[@type='x-ID' and text()='{stringId}']]"       
        elementsDup = root.xpath(xPathSelect)       
        print(f"There are {len(elementsDup)} elements in this duplicate container with xid {stringId}")       
        elementsDup.sort(key=lambda e: datetime.datetime.strptime(e.get('changedate'), "%Y%m%dT%H%M%SZ"))      
        if len(elementsDup) > 2:
            print(f"For xid {stringId}, more than 2 entries are found. Process initiated to remove until there are only two.")
            while len(elementsDup) > 2:
                elementToBeRemoved = elementsDup.pop(0)
                elementToBeRemoved.getparent().remove(elementToBeRemoved)
                backup_list.append(elementToBeRemoved)

Iterate Over Duplicate IDs
Print Duplicate Information: For each duplicate ID, the code prints a message
Select, Fetch and Count Duplicate IDs
Sort Duplicates by Change Date
Remove Old Duplicates and Updating Parent XML and Backup List

7. Write Backup and Updated TMX Files

    backup_tree = etree.ElementTree(backup_root)
    backup_tree.write(output_file_path + 'backup.tmx', pretty_print=True, xml_declaration=True, encoding='UTF-8')
    original_tree = etree.ElementTree(root)
    original_tree.write(output_file_path + 'updated.tmx', pretty_print=True, xml_declaration=True, encoding='UTF-8')

Save the backup and updated TMX files to specified file paths.

For my complete code, please visit my gitHub page. You are more than welcome to check out other fun solutions I created for localization engineering issues and share your thoughts with me!

A heartfelt thank you to my mentor, Shelton Chen , and my professor, Max Troyer , for inspiring and assisting me with this TMX file handling task.