Coding Challenge #118 - Comm
This challenge is to build your own version of comm, the classic Unix utility that compares two sorted files line by line.
comm is one of those small tools in the Unix toolbox that solves a single problem really well. Given two sorted files, it tells you three things at once: which lines are unique to the first file, which lines are unique to the second file, and which lines appear in both. It does this in a single streaming pass, taking advantage of the fact that the inputs are already sorted, so it never has to load whole files into memory and never has to do an O(n²) comparison.
You'll find comm used to diff lists of users, find files that exist in one directory tree but not another, work out the intersection of two datasets, and as a building block in countless shell pipelines. Building your own version is a lovely exercise in stream processing, careful state management, and the merge step that sits at the heart of merge sort, ideas you'll reach for again and again throughout your career.
The Challenge - Building Comm
In this challenge you're going to build your own version of comm, a streaming, sorted file comparison tool. Your tool will read two sorted files, compare them line by line, and print three columns of output: lines unique to the first file, lines unique to the second file, and lines common to both. It will be compatible with the standard POSIX comm utility, which means you'll be able to test your work directly against the system comm and use it as a drop-in replacement in shell pipelines.
The clever bit about comm is that it doesn't sort the files for you, it relies on the fact that they are already sorted to do its work in a single pass with constant memory. Two read pointers, one per file, and a small handful of comparison rules are all you need.
Step Zero
In this introductory step you're going to set your environment up ready to begin developing and testing your solution.
Choose your target platform and programming language. I'd encourage you to pick a language you're comfortable with for reading files line by line and parsing command-line arguments. Pretty much any general-purpose language is a good fit for this challenge. The focus is on the algorithm, not on the language.
Before you start coding, have a read through the POSIX comm specification and the man page on your own machine (man comm). Spend some time playing with the system comm so you get a feel for how it behaves, especially around the column indentation and the suppression flags.
Create a couple of small sorted test files to use throughout the challenge:
printf "apple\nbanana\ncherry\ndate\nelderberry\n" > file1.txt
printf "banana\ncherry\nfig\ngrape\n" > file2.txt
file1.txt and file2.txt are both sorted. The two files share banana and cherry, while apple, date, and elderberry are unique to the first and fig and grape are unique to the second. We'll use these throughout the challenge.
A quick note on locale. Both sort and the system comm only do byte-wise comparison when the locale is C or POSIX. With a default locale (especially on macOS), ordering becomes locale-aware -- Apple can sort between apple and banana rather than before them, which will confuse your test files and any comparisons you run against the system comm. Either set LC_ALL=C in your shell while working on this challenge, or prefix the relevant commands with it (LC_ALL=C sort ..., LC_ALL=C comm ...).
Step 1
In this step your goal is to read two sorted files and produce the three-column output that is the heart of comm.
Your tool should accept exactly two filenames as command-line arguments, open both files, and walk through them in sorted order using a single pass. For each line you should decide whether it belongs to column 1 (lines only in the first file), column 2 (lines only in the second file), or column 3 (lines in both). Column 1 has no leading tab, column 2 has one leading tab, and column 3 has two leading tabs. Lines should be compared byte by byte, the same way LC_ALL=C sort orders them.
You'll also need to handle files of different lengths -- when one file runs out, the remaining lines from the other file should be emitted in the appropriate column. Empty files are a useful edge case to think about: if one file is empty, every line of the other file goes into its own column. Don't worry about any flags or options yet; just get the basic three-column comparison working.
Testing: Run your tool against the test files and check your output:
ccomm file1.txt file2.txt
apple
banana
cherry
date
elderberry
fig
grape
Try it with files of different lengths, with one empty file, and with two completely disjoint files. Lines unique to the first file should appear with no leading tab, lines unique to the second file with one leading tab, and shared lines with two leading tabs.
Continued...
You can find the rest of this project on the Coding Challenges Substack.
Helping Leaders Build Meaning Before Visibility | Strategist | Writer | Founder of Syams Branding
1hThe power isn’t in the comparison, it’s in the fact that ordering is already solved before the tool runs. Once sorting is pushed upstream, the rest collapses into a linear walk.
Waaaay back I built my own chat program using winsock... that eventually mutated into my own .... trolling tool. lol.In high school I built an encrypt/decrypt tool to apply a Caesar cipher to text files.I suffer from "I'll build my own"-itus a LOT. looool. I've built my own- task tracking tool because the stuff I found didn't work like *I* wanted- health snapshot tool because I couldn't find a unified platform to analyse blood tests, workouts, eating, journals and co-pilot starts "forgetting" stuff... so I figured having the checkpointed data I could feed to an LLM as context would be better.And that's just in the very recent past. haha
I have a better idea. 1. build your own language 2. build an interpreter/compiler for it 3. rewrite that compiler in your new language and compile it with itself 4. then build everything on this list in your language ;-)
The problem with "Build your own " challenges is that there's usually no "decision-making guide" on how to decide on the trade-offs, the potential and features, and so on. But once you have this, these challenges are indeed great John Crickett 👏ï‚‚ Thanks for your great work!
The strongest developer growth comes from rebuilding core systems, where understanding replaces imitation and complexity reveals itself through hands-on implementation.