Coding Challenge #118 - Comm

Coding Challenge #118 - Comm

This challenge is to build your own version of comm, the classic Unix utility that compares two sorted files line by line.

comm is one of those small tools in the Unix toolbox that solves a single problem really well. Given two sorted files, it tells you three things at once: which lines are unique to the first file, which lines are unique to the second file, and which lines appear in both. It does this in a single streaming pass, taking advantage of the fact that the inputs are already sorted, so it never has to load whole files into memory and never has to do an O(n²) comparison.

You'll find comm used to diff lists of users, find files that exist in one directory tree but not another, work out the intersection of two datasets, and as a building block in countless shell pipelines. Building your own version is a lovely exercise in stream processing, careful state management, and the merge step that sits at the heart of merge sort, ideas you'll reach for again and again throughout your career.

The Challenge - Building Comm

In this challenge you're going to build your own version of comm, a streaming, sorted file comparison tool. Your tool will read two sorted files, compare them line by line, and print three columns of output: lines unique to the first file, lines unique to the second file, and lines common to both. It will be compatible with the standard POSIX comm utility, which means you'll be able to test your work directly against the system comm and use it as a drop-in replacement in shell pipelines.

The clever bit about comm is that it doesn't sort the files for you, it relies on the fact that they are already sorted to do its work in a single pass with constant memory. Two read pointers, one per file, and a small handful of comparison rules are all you need.

Step Zero

In this introductory step you're going to set your environment up ready to begin developing and testing your solution.

Choose your target platform and programming language. I'd encourage you to pick a language you're comfortable with for reading files line by line and parsing command-line arguments. Pretty much any general-purpose language is a good fit for this challenge. The focus is on the algorithm, not on the language.

Before you start coding, have a read through the POSIX comm specification and the man page on your own machine (man comm). Spend some time playing with the system comm so you get a feel for how it behaves, especially around the column indentation and the suppression flags.

Create a couple of small sorted test files to use throughout the challenge:

printf "apple\nbanana\ncherry\ndate\nelderberry\n" > file1.txt
printf "banana\ncherry\nfig\ngrape\n" > file2.txt
        

file1.txt and file2.txt are both sorted. The two files share banana and cherry, while apple, date, and elderberry are unique to the first and fig and grape are unique to the second. We'll use these throughout the challenge.

A quick note on locale. Both sort and the system comm only do byte-wise comparison when the locale is C or POSIX. With a default locale (especially on macOS), ordering becomes locale-aware -- Apple can sort between apple and banana rather than before them, which will confuse your test files and any comparisons you run against the system comm. Either set LC_ALL=C in your shell while working on this challenge, or prefix the relevant commands with it (LC_ALL=C sort ..., LC_ALL=C comm ...).

Step 1

In this step your goal is to read two sorted files and produce the three-column output that is the heart of comm.

Your tool should accept exactly two filenames as command-line arguments, open both files, and walk through them in sorted order using a single pass. For each line you should decide whether it belongs to column 1 (lines only in the first file), column 2 (lines only in the second file), or column 3 (lines in both). Column 1 has no leading tab, column 2 has one leading tab, and column 3 has two leading tabs. Lines should be compared byte by byte, the same way LC_ALL=C sort orders them.

You'll also need to handle files of different lengths -- when one file runs out, the remaining lines from the other file should be emitted in the appropriate column. Empty files are a useful edge case to think about: if one file is empty, every line of the other file goes into its own column. Don't worry about any flags or options yet; just get the basic three-column comparison working.

Testing: Run your tool against the test files and check your output:

ccomm file1.txt file2.txt
apple
		banana
		cherry
date
elderberry
	fig
	grape
        

Try it with files of different lengths, with one empty file, and with two completely disjoint files. Lines unique to the first file should appear with no leading tab, lines unique to the second file with one leading tab, and shared lines with two leading tabs.

Continued...

You can find the rest of this project on the Coding Challenges Substack.

Syamsul Rizal

Helping Leaders Build Meaning Before Visibility | Strategist | Writer | Founder of Syams Branding

1h

The power isn’t in the comparison, it’s in the fact that ordering is already solved before the tool runs. Once sorting is pushed upstream, the rest collapses into a linear walk.

Like
Reply

Waaaay back I built my own chat program using winsock... that eventually mutated into my own .... trolling tool. lol.In high school I built an encrypt/decrypt tool to apply a Caesar cipher to text files.I suffer from "I'll build my own"-itus a LOT. looool. I've built my own- task tracking tool because the stuff I found didn't work like *I* wanted- health snapshot tool because I couldn't find a unified platform to analyse blood tests, workouts, eating, journals and co-pilot starts "forgetting" stuff... so I figured having the checkpointed data I could feed to an LLM as context would be better.And that's just in the very recent past. haha

Like
Reply

I have a better idea. 1. build your own language 2. build an interpreter/compiler for it 3. rewrite that compiler in your new language and compile it with itself 4. then build everything on this list in your language ;-)

The problem with "Build your own " challenges is that there's usually no "decision-making guide" on how to decide on the trade-offs, the potential and features, and so on. But once you have this, these challenges are indeed great John Crickett 👏ï‚‚ Thanks for your great work!

The strongest developer growth comes from rebuilding core systems, where understanding replaces imitation and complexity reveals itself through hands-on implementation.

To view or add a comment, sign in

More articles by John Crickett

  • Coding Challenge #117 - AI Powered Support Bot

    This challenge is to build your own AI-powered customer support bot - and then discover, the hard way, why production…

    12 Comments
  • Coding Challenge #116 - Awk

    This challenge is to build your own version of awk, the classic text processing language. Awk was created in 1977 by…

    12 Comments
  • Coding Challenge #115 - Code Sherpa

    This challenge is to build your own semantic code exploration tool - a system that helps developers make sense of…

    20 Comments
  • Coding Challenge #114 - Gzip

    This challenge is to build your own version of gzip, the widely used file compression utility. Gzip has been a…

    9 Comments
  • Coding Challenge #113 - AI Writing Detector

    This challenge is to build your own AI writing detector that analyses text and determines the likelihood it was written…

    16 Comments
  • ## Coding Challenge #112 - AI Coding Agent

    This challenge is to build your own AI coding agent - a command-line tool that can read, understand, and modify code on…

    10 Comments
  • Coding Challenge #111 - AI Agent Scheduling System

    This challenge is to build your own AI agent scheduling system - a system that runs AI-powered tasks automatically on a…

    10 Comments
  • Coding Challenge #110 - RTFM For Me Agent

    This challenge is to build your own AI-powered documentation assistant - a tool that can ingest technical…

    12 Comments
  • Coding Challenge #109 - Ebook Reader

    This challenge is to build your own ebook reader application. EPUB is the most widely used open standard for digital…

    7 Comments
  • Coding Challenge #108 - Online Python Playground

    Coding Challenge #108 - Online Python Playground This challenge is to build your own online code playground where users…

    10 Comments

Explore content categories