In this post we will be creating a Python script that will tokenize text, to do this we will be using NLTK short for the natural language toolkit. This consists of a suite of libraries dedicated to natural language processing also known as NLP.
Tokenization is the process of breaking down text in to smaller units which are called tokens. Tokens can consist of words, symbols, phrases or even other elements within a given piece of text.
For the purpose of this example we will be creating a Python script that will tokenize the words in our text. Text will be provided via user input.
See the sample of code below for how this is achieved.
import nltk
#Download necessary NLTK data
nltk.download('punkt')
while True:
text =input("Enter Text to Tokenize: ")
tokens = nltk.word_tokenize(text)
print(tokens)
An example of the above’s running can be seen below.
Enter Text to Tokenize: Hello and welcome to scriptopia.
['Hello', 'and', 'welcome', 'to', 'scriptopia', '.']
>>>
Take a look at some of our other content around the Python programming language by clicking here.