From the course: CompTIA Data+ (DA0-002) Cert Prep

Regular expressions

- [Instructor] When you're working with detailed, structured text, such as logs, monitoring output, or application events, you often need to isolate specific pieces of information. You might need to extract all email addresses, identify error messages, or find every IP address in a list of entries. Regular expressions, or regex, give you a way to do that with precision. Regex is a compact and flexible way to search for patterns in text. Instead of matching exact phrases, regex allows you to define the structure of what you're looking for. That might be a date, a status code, or user identifier. Data analysts use regex all the time for data processing, scripting, validation, and log analysis. A regular expression is made up of characters that represent both literal text and special rules. You can combine these characters to describe the structure of the data that you want to match. Some characters match themselves. If you type a word, the expression will look for that exact word in the text. Other characters have special meaning. These are called meta characters. They let you define flexible patterns instead of fixed values. Meta characters are the part of regex that make it powerful. These are symbols that don't match themselves, but instead control how the pattern behaves. They allow you to define general rules instead of looking for exact text. Let's review some of the most important meta characters. One important meta character is the . or period. It matches any single character no matter what that character is. You use this when you know the number of characters that you want, but you don't care what they are. [] let you define a set or range of characters. They match a single character from the list that you provide. You can include letters, numbers, or specific symbols. The + means one or more of the previous element. It tells the expression to keep matching as long as that element appears at least once. And the * means zero or more of the previous element. It's the same thing as the +, but it also allows for the element to not appear at all. And you often need to search for boundaries between words. You can represent these with \b. These meta characters form the foundation of most regex patterns. Learning what they do and how they work together is the first step in writing expressions that match real-world data. So let's turn to some examples and see how these concepts come together in practice. We're going to use a tool called regex101.com. This is a free browser-based environment that lets you write regular expressions and test them against sample text that you provide. It shows you exactly what your pattern matches, how many times it matches, and where in the text those matches occur. It also provides you with an explanation of how your regex is performing. Now I'm going to begin by pasting my dataset into this test string window. This dataset is just a sample of log entries. Now I'm going to do the most simple search I can imagine. I'm just going to look for the word, error, in my dataset. I don't have to use any meta characters because I'm just looking for a simple find. I'm looking for the word, error, so I'm going to type that in to my regular expression field, And then I can see there are four matches, and it's highlighted the four times the word, error, appears in the dataset. One of them, I have to scroll down here to see, but all four of them are highlighted. I also get an explanation of my regular expression. It's telling me that error matches the character's error, literally. So now let's make it a little more challenging. Let's say I want to find every word that starts with the letter A. Well, the first thing I might do is just type A into my regular expression. And now that gives me every occurrence of the letter A in my dataset. There's 75 of them. And there are some of them are in the middle of words, some of them are at the beginning of words. They're all over the place. Now, what I want is the letter A to appear at the beginning of the word. So that's a word boundary. What we want to match are cases where we have a word boundary followed by the letter A. And if you remember, my meta character for a word boundary is \b. So when I type \b and then A, I get every word that starts with a letter A, but notice that it's only highlighted the A. I want to highlight the whole word. There's also one other little nuance here. This dataset happens to be all lowercase, but a word could also start with the uppercase letter A. So what I'm going to do is put in square brackets a lowercase a and a capital A. So that expression matches a word boundary followed by any one of those two characters, either a lowercase a or an uppercase A. And I have my 13 matches of words that start with the letter A. Now, I want the entire word, so the next thing I need to do is, say, this is going to be followed by some other letters. There's only letters and words. So I'm going to put another square bracket, and I'm going to say the lowercase letters a-z, and then the uppercase letters A-Z, just in case we have some capital letters there in the middle. Now what I have in my regular expression is still the same 13 matches, but I've highlighted the first two characters. I have the uppercase or lowercase a, followed by any other letter. Now, of course, words are more than two letters long in most cases, so what I also need to say is that there's going to be zero or more matches. So I'm going to put a + to add that to my regular expression. And then I'm just going to say, I want that to be a word boundary. And now what I have are my 13 matches, and I've highlighted the entire word every time a word beginning with a letter A appears in my data set. Next, let's do something a little more complicated. Let's say that I want to find IP addresses. Now, I'm going to keep this simple and say that an IP address is any set of four numbers separated by periods. That's enough to keep our example of the regular expression here simple. So what we're going to say is we have the characters 0-9. Then we're going to put the + to say one or more of those. And then we're going to have a . Now we have to use the \ and a . because the . is a meta character. Putting \. just says match the period. And you can see we've started to highlight on the screen cases where we have a number followed by a . Now we want four numbers, so I'm just going to repeat this. I'm going to say then say 0-9 again, + and a . Now we have the beginnings of all these IP addresses highlighted. We'll do the next number, 0-9, a + and a . And then the last number, 0-9. Oops, 0-9 and a +. And now we've highlighted the IP addresses. Now, of course, this is going to match things that aren't IP addresses. I could have a five-digit number in there, for example, or a three-digit number that's greater than 255. I could go much deeper with my regular expression and filter those out, but that's beyond the scope of the course. We're just trying to give you a general idea of how regular expressions work here. Let's try a more complicated one. Let's look for email addresses. So we're going to look for something that stands by itself. So I'm going to put two word boundaries and say the email addresses should be surrounded by spaces or punctuation or at the beginning or end of a string. And then the way an email address appears is we have a combination of letters or numbers followed by the @, then another combination of letters or numbers followed by a . And then we're going to look for email addresses that end with a three-letter, top-level domain. So something like .com, .net, .org. So we'll put three letters after that. So let's begin writing that regular expression. First, we're going to have that string of letters and numbers, so lowercase letters or uppercase letters, or the numbers 0-9. And we want to have one or more of those, so I'll put my + there. And then I want the @, so I'm going to type in the @. So what you can see there is we've already sort of found the email addresses. We're in the right area. We just have to make sure we get the whole email address. We've got in the prefixes here. So the next part of the email address is, again, going to be letters and numbers, a-z, A-Z, 0-9, and any number of 'em followed by a . And again, we have to use \. So we're using the literal character . instead of the meta character. And then we want to have three letters. So we're going to have this time just a-z, A-Z. There's no numbers in those. And since I want exactly three, the way I can specify that is putting in curly braces the number 3. And now I found the seven email addresses that appear in my dataset. Now again, we've made some simplifications here, but you don't need to be an expert on regular expressions for the exam. We're just trying to get you familiar with this concept. I'd recommend that you take some time to practice with regular expressions yourself before you take the test. Visit regex101.com and try some of your own examples. You'll want to be familiar with the concept before you take the exam, and the only real way to do that is with hands-on practice.

Contents