Data Driven - Yet Again!
Good morning, LinkedIn!
(If you're hearing that in Robin Williams' voice from "Good Morning, Vietnam", we might be on the same wavelength!)
I took the plunge and started looking into the data about the learning journey(s) I capture more deeply...
Well, that's very meta. No, not that Meta!
I started down the rabbit hole of tracking which LinkedIn folks are influencing my thinking about three months ago. Sadly, the URLs that I've captured in that time don't tell a consistent story. There are around 100 of you, but you may not know who you are. Yet!
I have no clue (so far) what separates the "activity-#" URLs from all the rest of the "/posts/". Using an XLOOKUP against the data I capture, I can still get some decent information about the who and the what of it ... assuming I typed in something coherent in the first place.
For the most part, a "/posts/" URL contains the LinkedIn ID of the original poster as the first part of the page reference before an underscore character. Huzzah! That gets me to a name of some kind at least. Some are more obvious than others. Some have one word with no dashes. Some have multiple words with dashes. Some have multiple words and a trailing hexadecimal string with dashes. A few have "escaped" Unicode characters in them. Fun times!
Of course, "/posts/" are not all there is, and the format of the other URLs gets tougher and tougher.
So, the "/advice/", "/events/", "/feed/", "/pulse/", "/video/", etc. all have different URL formats. Grumble! The first three types don't appear to contain any information about the poster / originator. "Pulse" URLs do contain information about the poster / originator, but it's at the end and no longer separated by an underscore from the rest of the URL. Grumble, grumble!
Since the poster / originator's LinkedIn ID could be (at least) one to three words separated by dashes and could include "escaped" Unicode characters, uniquely figuring out who wrote it from just the URL is difficult. If nothing else, it's a problem for another day.
For now, I'll fall back on my own data inputs and still more XLOOKUPs to collocate some meaningful data in that sheet.
If you're wondering where I get the URLs, pretty much every post, article, etc. has a "three dots" icon in the upper right if you're browsing LinkedIn on the web. Click that, and the drop-down menu contains some form of:
Sadly, the "Copy link to post" comes with a bunch of tracking "stuff" in the URL. At some point, I'll automate filtering that out. For now, I just remove it as I paste it into the spreadsheet. Grumble!
YouTube
Lo, and behold! There is an API quota for free accounts querying the details of YouTube videos by video ID. I'm not sure what the number or number per day is exactly, but I hit it while testing these upgrades. Trying it again the next morning, much less than 24 hours later, I was below quota. So, that will take some more research.
On the flip side, that crashed my Python code properly with an un-handled exception since I didn't know which exception to handle. For that case, it was a googleapiclient.errors.HttpError, and the stack trace / details showed that it was explicitly a "reason: quotaExceeded". Huzzah!
There's still some data flow that results in adding a trailing space to the URLs I copy/paste or move into the spreadsheet, but the number of those is way down. Still looking for the pattern there.
In Excel, VLOOKUP is good, but XLOOKUP can solve a few more problems! I'm up to 40+ YouTube links that no longer allow lookups by video ID. Fortunately, I capture more data in my to do list / status sheet that I can mine back out with an XLOOKUP.
Why an XLOOKUP? In this case, the value against which I'm matching is in Column I (Comments), but what I want to retrieve is in Column H (Description). Going the other way (H->I) would work just fine for a VLOOKUP. Going from I->H means a little extra work.
I still don't have a handle on how to deal with this particular situation in my Python code. The API query returns a nice, happy 200 like it (almost!) always does, but it returns no data. Now that I've seen an API failure, I know they're possible and come back as custom exception types.
Formatting
There may be some pivot tables in the future here, but for the recent sites versus old sites lists, I've got "conditional formatting" adding a light green fill for new site names (versus my first stab at this data).
Similarly, I've got some light red fill on the old site list versus the new site list. There are only a few, and they're not general-purpose URLs (Teams, Zooms, etc.), so that doesn't worry me much.
I definitely got the Python code wrong when I put quotes around the "=VLOOKUP" and "=XLOOKUP" functions. Oopsy! Those came through as normal text strings and had to be massaged a bit to work properly. We'll see if my workaround fixes that next month.
Conclusions
From April through June, I found work (or industry) related information on dozens more web sites than my first data capture. I started tracking work / industry-related learning bits from LinkedIn, and I upgraded the Python to do more of the data processing work. There are plenty of opportunities left for API calls, more formatting, etc.
Making this some kind of rolling, monthly update will take more thought. Comparing against the first list of URLs is easy. Comparing against URLs, posters / creators, etc. is tougher.
Futures
At some point, I may get back in the groove of using the "openpyxl" Python library that I have loved in a few jobs. A mashup of old code from here and there would probably let me up-load the current Excel file, have the Python code extract the data from the "Comments" column, do all of the work to update the "sites" sheet, then send back an updated Excel file with the changes in one shot. Future plans!