How to Read Data From Text File Until Eop
How to extract specific portions of a text file using Python
Updated: 06/30/2020 by Computer Hope
Extracting text from a file is a common task in scripting and programming, and Python makes information technology easy. In this guide, we'll discuss some simple ways to extract text from a file using the Python 3 programming linguistic communication.
Brand certain you're using Python three
In this guide, nosotros'll be using Python version iii. Virtually systems come up pre-installed with Python 2.7. While Python 2.7 is used in legacy code, Python 3 is the nowadays and future of the Python language. Unless y'all have a specific reason to write or support Python two, nosotros recommend working in Python three.
For Microsoft Windows, Python 3 tin exist downloaded from the Python official website. When installing, brand sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, as shown in the epitome beneath.
On Linux, you can install Python 3 with your parcel director. For example, on Debian or Ubuntu, y'all can install it with the following control:
sudo apt-go update && sudo apt-go install python3
For macOS, the Python 3 installer can be downloaded from python.org, every bit linked above. If y'all are using the Homebrew bundle managing director, it tin also be installed past opening a terminal window (Applications → Utilities), and running this command:
mash install python3
Running Python
On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if yous installed the launcher, the command is py. The commands on this page employ python3; if you lot're on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more than information about using the interpreter, run across Python overview: using the Python interpreter. If you accidentally enter the interpreter, y'all can leave it using the command go out() or quit().
Running Python with a file proper name will interpret that python program. For instance:
python3 program.py
...runs the program contained in the file program.py.
Okay, how tin can we use Python to extract text from a text file?
Reading data from a text file
First, let's read a text file. Let's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum case text.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Notation
In all the examples that follow, nosotros work with the four lines of text contained in this file. Copy and paste the latin text above into a text file, and relieve it every bit lorem.txt, so you can run the instance code using this file every bit input.
A Python plan can read a text file using the built-in open() function. For case, the Python 3 program beneath opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and prints the data.
myfile = open("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read() # read the entire file to string myfile.shut() # shut the file print(contents) # print string contents
Hither, myfile is the name we give to our file object.
The "rt" parameter in the open up() function ways "we're opening this file to read text data"
The hash mark ("#") means that everything on that line is a annotate, and it'due south ignored by the Python interpreter.
If y'all save this plan in a file chosen read.py, you tin run it with the following command.
python3 read.py
The command above outputs the contents of lorem.txt:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Using "with open"
It's of import to close your open files as soon as possible: open up the file, perform your functioning, and close it. Don't leave it open for extended periods of fourth dimension.
When you lot're working with files, it's adept practice to utilise the with open...every bit compound statement. It's the cleanest manner to open a file, operate on it, and close the file, all in ane easy-to-read cake of code. The file is automatically airtight when the code block completes.
Using with open...as, we tin rewrite our program to await like this:
with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text contents = myfile.read() # Read the entire file to a string print(contents) # Print the string
Annotation
Indentation is important in Python. Python programs utilize white space at the beginning of a line to define scope, such as a block of lawmaking. We recommend you use four spaces per level of indentation, and that y'all use spaces rather than tabs. In the post-obit examples, make sure your code is indented exactly as it's presented here.
Example
Relieve the plan as read.py and execute it:
python3 read.py
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Reading text files line-by-line
In the examples so far, we've been reading in the whole file at once. Reading a full file is no large deal with small files, but more often than not speaking, it'due south not a bully thought. For one thing, if your file is bigger than the amount of available retentivity, you'll encounter an error.
In nearly every case, it's a better idea to read a text file 1 line at a time.
In Python, the file object is an iterator. An iterator is a type of Python object which behaves in certain ways when operated on repeatedly. For instance, you lot can use a for loop to operate on a file object repeatedly, and each fourth dimension the same performance is performed, y'all'll receive a unlike, or "next," effect.
Example
For text files, the file object iterates one line of text at a fourth dimension. Information technology considers one line of text a "unit" of data, and then we can use a for...in loop statement to iterate 1 line at a time:
with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading for myline in myfile: # For each line, read to a string, impress(myline) # and print the string.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Notice that nosotros're getting an extra line break ("newline") subsequently every line. That's considering ii newlines are being printed. The first one is the newline at the end of every line of our text file. The 2nd newline happens because, by default, impress() adds a linebreak of its own at the terminate of any yous've asked information technology to print.
Let'due south store our lines of text in a variable — specifically, a listing variable — so we can look at it more closely.
Storing text data in a variable
In Python, lists are similar to, but not the same as, an assortment in C or Coffee. A Python list contains indexed data, of varying lengths and types.
Instance
mylines = [] # Declare an empty list named mylines. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text information. for myline in myfile: # For each line, stored every bit myline, mylines.suspend(myline) # add its contents to mylines. print(mylines) # Impress the listing.
The output of this program is a little dissimilar. Instead of printing the contents of the list, this program prints our listing object, which looks like this:
Output:
['Lorem ipsum dolor sit down amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\n', 'Quisque at dignissim lacus.\n']
Hither, we see the raw contents of the list. In its raw object course, a listing is represented as a comma-delimited list. Here, each element is represented as a string, and each newline is represented as its escape grapheme sequence, \n.
Much like a C or Java array, the list elements are accessed by specifying an alphabetize number later the variable name, in brackets. Index numbers start at zero — other words, the due northth element of a list has the numeric alphabetize n-1.
Notation
If you're wondering why the alphabetize numbers start at naught instead of i, you lot're not alone. Computer scientists have debated the usefulness of zip-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why zilch-based numbering is the best way to alphabetize data in calculator scientific discipline. You lot can read the memo yourself — he makes a compelling statement.
Example
We can print the first element of lines by specifying alphabetize number 0, contained in brackets later on the name of the list:
impress(mylines[0])
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Example
Or the third line, by specifying index number two:
print(mylines[two])
Output:
Quisque at dignissim lacus.
But if nosotros try to access an index for which there is no value, we become an error:
Example
print(mylines[three])
Output:
Traceback (most contempo phone call final): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list index out of range
Example
A list object is an iterator, and so to print every element of the list, we can iterate over it with for...in:
mylines = [] # Declare an empty list with open ('lorem.txt', 'rt') every bit myfile: # Open up lorem.txt for reading text. for line in myfile: # For each line of text, mylines.suspend(line) # add that line to the list. for element in mylines: # For each element in the list, print(element) # print information technology.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Simply we're nevertheless getting extra newlines. Each line of our text file ends in a newline character ('\n'), which is beingness printed. Also, subsequently press each line, impress() adds a newline of its own, unless you lot tell it to do otherwise.
Nosotros tin can alter this default beliefs by specifying an end parameter in our print() call:
print(element, end='')
By setting end to an empty string (two unmarried quotes, with no space), nosotros tell print() to print nothing at the cease of a line, instead of a newline character.
Example
Our revised program looks similar this:
mylines = [] # Declare an empty list with open ('lorem.txt', 'rt') as myfile: # Open file lorem.txt for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for element in mylines: # For each element in the list, print(chemical element, end='') # impress it without extra newlines.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
The newlines you come across here are really in the file; they're a special character ('\n') at the end of each line. We want to get rid of these, so nosotros don't have to worry about them while we process the file.
How to strip newlines
To remove the newlines completely, we tin strip them. To strip a string is to remove one or more than characters, usually whitespace, from either the kickoff or terminate of the string.
Tip
This procedure is sometimes also called "trimming."
Python 3 string objects take a method chosen rstrip(), which strips characters from the right side of a string. The English language reads left-to-right, so stripping from the right side removes characters from the end.
If the variable is named mystring, we can strip its right side with mystring.rstrip(chars), where chars is a cord of characters to strip. For example, "123abc".rstrip("bc") returns 123a.
Tip
When y'all represent a string in your plan with its literal contents, it's called a string literal. In Python (as in most programming languages), string literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you tin can use one or the other, as long every bit they match on both ends of the string. Information technology's traditional to correspond a human-readable string (such as Hello) in double-quotes ("Hello"). If you lot're representing a single graphic symbol (such equally b), or a single special character such as the newline character (\northward), information technology's traditional to use single quotes ('b', '\n'). For more data nearly how to use strings in Python, you can read the documentation of strings in Python.
The statement string.rstrip('\due north') will strip a newline graphic symbol from the right side of string. The following version of our program strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') every bit myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\n')) # strip newline and add to list. for chemical element in mylines: # For each element in the listing, impress(chemical element) # print it.
The text is now stored in a list variable, then individual lines can be accessed by index number. Newlines were stripped, so we don't have to worry most them. We tin always put them back subsequently if nosotros reconstruct the file and write information technology to deejay.
Now, let'south search the lines in the list for a specific substring.
Searching text for a substring
Let's say nosotros want to locate every occurrence of a sure phrase, or even a single letter. For instance, maybe we need to know where every "e" is. We tin accomplish this using the cord'south find() method.
The list stores each line of our text every bit a string object. All string objects take a method, observe(), which locates the first occurrence of a substrings in the string.
Permit's use the find() method to search for the alphabetic character "e" in the first line of our text file, which is stored in the list mylines. The first chemical element of mylines is a cord object containing the get-go line of the text file. This cord object has a observe() method.
In the parentheses of find(), we specify parameters. The get-go and only required parameter is the string to search for, "eastward". The statement mylines[0].find("e") tells the interpreter to search frontwards, starting at the beginning of the cord, one character at a fourth dimension, until it finds the letter "due east." When it finds i, it stops searching, and returns the index number where that "east" is located. If it reaches the end of the string, it returns -ane to indicate nothing was found.
Instance
print(mylines[0].find("e"))
Output:
3
The return value "3" tells united states of america that the letter of the alphabet "e" is the fourth character, the "eastward" in "Lorem". (Remember, the alphabetize is naught-based: index 0 is the first character, 1 is the second, etc.)
The observe() method takes two optional, boosted parameters: a start index and a terminate index, indicating where in the string the search should brainstorm and end. For instance, cord.notice("abc", x, twenty) searches for the substring "abc", but just from the 11th to the 21st grapheme. If stop is not specified, find() starts at index starting time, and stops at the stop of the string.
Example
For case, the following statement searchs for "e" in mylines[0], starting time at the fifth character.
print(mylines[0].find("east", four))
Output:
24
In other words, starting at the 5th character in line[0], the start "eastward" is located at index 24 (the "e" in "nec").
Example
To beginning searching at index x, and stop at index 30:
print(mylines[1].observe("e", 10, thirty))
Output:
28
(The first "eastward" in "Maecenas").
If notice() doesn't locate the substring in the search range, it returns the number -1, indicating failure:
print(mylines[0].find("e", 25, xxx))
Output:
-ane
At that place were no "e" occurrences between indices 25 and 30.
Finding all occurrences of a substring
Merely what if we want to locate every occurrence of a substring, not just the beginning one we encounter? We can iterate over the cord, starting from the alphabetize of the previous friction match.
In this example, we'll use a while loop to repeatedly find the letter "e". When an occurrence is found, we call find again, starting from a new location in the string. Specifically, the location of the last occurrence, plus the length of the string (and then nosotros tin motion forward past the last one). When discover returns -i, or the get-go alphabetize exceeds the length of the string, we stop.
# Build assortment of lines from file, strip newlines mylines = [] # Declare an empty listing. with open ('lorem.txt', 'rt') equally myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\n')) # strip newline and add together to listing. # Locate and print all occurences of letter "east" substr = "e" # substring to search for. for line in mylines: # string to exist searched index = 0 # current alphabetize: character being compared prev = 0 # previous alphabetize: final graphic symbol compared while index < len(line): # While index has not exceeded cord length, index = line.find(substr, index) # set up index to first occurrence of "due east" if index == -1: # If nix was establish, break # exit the while loop. print(" " * (index - prev) + "eastward", stop='') # print spaces from previous # match, and then the substring. prev = alphabetize + len(substr) # remember this position for next loop. alphabetize += len(substr) # increase the alphabetize by the length of substr. # (Repeat until alphabetize > line length) impress('\n' + line); # Impress the original string under the e's
Output:
east e eastward eastward e Lorem ipsum dolor sit amet, consectetur adipiscing elit. e eastward Nunc fringilla arcu congue metus aliquam mollis. eastward e east e east eastward Mauris nec maximus purus. Maecenas sit amet pretium tellus. e Quisque at dignissim lacus.
Incorporating regular expressions
For complex searches, use regular expressions.
The Python regular expressions module is called re. To apply information technology in your plan, import the module before you use it:
import re
The re module implements regular expressions past compiling a search pattern into a pattern object. Methods of this object can then be used to perform friction match operations.
For example, let'south say you desire to search for any word in your document which starts with the alphabetic character d and ends in the letter r. Nosotros tin attain this using the regular expression "\bd\w*r\b". What does this mean?
character sequence | significant |
---|---|
\b | A give-and-take boundary matches an empty string (anything, including cipher at all), only only if it appears earlier or after a not-word character. "Word characters" are the digits 0 through 9, the lowercase and uppercase letters, or an underscore ("_"). |
d | Lowercase letter of the alphabet d. |
\w* | \w represents any discussion graphic symbol, and * is a quantifier meaning "zero or more of the previous character." So \w* will match naught or more word characters. |
r | Lowercase letter r. |
\b | Word boundary. |
So this regular expression will match any string that tin be described as "a discussion boundary, so a lowercase 'd', then zippo or more word characters, and then a lowercase 'r', then a word boundary." Strings described this way include the words destroyer, bleak, and md, and the abbreviation dr.
To use this regular expression in Python search operations, we first compile it into a pattern object. For example, the post-obit Python statement creates a blueprint object named blueprint which nosotros can use to perform searches using that regular expression.
pattern = re.compile(r"\bd\w*r\b")
Note
The letter r before our string in the statement above is important. It tells Python to interpret our string as a raw string, exactly as we've typed it. If we didn't prefix the string with an r, Python would interpret the escape sequences such as \b in other ways. Whenever you need Python to interpret your strings literally, specify it as a raw string by prefixing it with r.
Now we tin can employ the pattern object'south methods, such every bit search(), to search a string for the compiled regular expression, looking for a friction match. If it finds ane, it returns a special result called a match object. Otherwise, it returns None, a built-in Python abiding that is used like the boolean value "false".
import re str = "Good morning, doctor." pat = re.compile(r"\bd\west*r\b") # compile regex "\bd\w*r\b" to a pattern object if pat.search(str) != None: # Search for the blueprint. If found, print("Found it.")
Output:
Institute it.
To perform a case-insensitive search, you tin specify the special constant re.IGNORECASE in the compile step:
import re str = "Hello, Dr.." pat = re.compile(r"\bd\west*r\b", re.IGNORECASE) # upper and lowercase will match if pat.search(str) != None: print("Plant information technology.")
Output:
Constitute it.
Putting it all together
So now we know how to open a file, read the lines into a list, and locate a substring in any given list element. Permit's use this knowledge to build some example programs.
Print all lines containing substring
The program beneath reads a log file line by line. If the line contains the word "error," it is added to a list called errors. If non, it is ignored. The lower() cord method converts all strings to lowercase for comparison purposes, making the search case-insensitive without altering the original strings.
Annotation that the discover() method is called directly on the result of the lower() method; this is chosen method chaining. Too, note that in the impress() argument, we construct an output string by joining several strings with the + operator.
errors = [] # The list where we volition store results. linenum = 0 substr = "error".lower() # Substring to search for. with open up ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += ane if line.lower().find(substr) != -ane: # if instance-insensitive lucifer, errors.append("Line " + str(linenum) + ": " + line.rstrip('\due north')) for err in errors: print(err)
Input (stored in logfile.txt):
This is line 1 This is line 2 Line three has an error! This is line iv Line 5 also has an error!
Output:
Line 3: Line three has an mistake! Line five: Line 5 also has an mistake!
Extract all lines containing substring, using regex
The programme below is like to the higher up program, but using the re regular expressions module. The errors and line numbers are stored as tuples, eastward.1000., (linenum, line). The tuple is created by the boosted enclosing parentheses in the errors.suspend() argument. The elements of the tuple are referenced similar to a listing, with a zero-based index in brackets. As constructed here, err[0] is a linenum and err[one] is the associated line containing an mistake.
import re errors = [] linenum = 0 pattern = re.compile("error", re.IGNORECASE) # Compile a example-insensitive regex with open ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += 1 if design.search(line) != None: # If a match is found errors.append((linenum, line.rstrip('\due north'))) for err in errors: # Iterate over the listing of tuples print("Line " + str(err[0]) + ": " + err[ane])
Output:
Line 6: Mar 28 09:10:37 Error: cannot contact server. Connexion refused. Line ten: Mar 28 x:28:15 Kernel fault: The specified location is non mounted. Line xiv: Mar 28 11:06:30 ERROR: usb ane-1: can't set config, exiting.
Extract all lines containing a phone number
The plan below prints any line of a text file, info.txt, which contains a United states of america or international phone number. It accomplishes this with the regular expression "(\+\d{1,2})?[\s.-]?\d{3}[\s.-]?\d{iv}". This regex matches the following phone number notations:
- 123-456-7890
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{1,2})?[\southward.-]?\d{3}[\s.-]?\d{4}") with open ('info.txt', 'rt') as myfile: for line in myfile: linenum += 1 if design.search(line) != None: # If design search finds a match, errors.suspend((linenum, line.rstrip('\n'))) for err in errors: print("Line ", str(err[0]), ": " + err[1])
Output:
Line 3 : My phone number is 731.215.8881. Line 7 : You can reach Mr. Walters at (212) 558-3131. Line 12 : His agent, Mrs. Kennedy, can exist reached at +12 (123) 456-7890 Line 14 : She tin also be contacted at (888) 312.8403, extension 12.
Search a dictionary for words
The program below searches the dictionary for any words that kickoff with h and end in pe. For input, it uses a dictionary file included on many Unix systems, /usr/share/dict/words.
import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\west*pe$", re.IGNORECASE) with open(filename, "rt") as myfile: for line in myfile: if design.search(line) != None: print(line, end='')
Output:
Promise heliotrope hope hornpipe horoscope hype
Source: https://www.computerhope.com/issues/ch001721.htm
0 Response to "How to Read Data From Text File Until Eop"
Publicar un comentario