Python Natural Language Processing Cookbook book announcement

I am happy to announce that my book, Python Natural Language Processing Cookbook, is now available in print and as a Kindle book.

In this blog post I will tell you more about the book so that you can better understand whether the book is for you.

The first important thing about the book is the Packt Publishing Cookbook format. This format is a purely practical with instructions on how to solve a certain NLP problem, such as, “How do I parse out parts of speech in text?” Each problem is solved using a code “recipe”. There are line by line instructions on how to create a program that accomplishes a certain task, and short explanations on how the program works.

There is an accompanying Github repository for the book, located here: https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook. The repository contains all the code that is referenced in the book.

The book includes a variety of topics (view the full table of contents):

  1. Chapter 1: Learning NLP basics
  2. Chapter 2: Playing with grammar
  3. Chapter 3: Representing text: capturing semantics
  4. Chapter 4: Classifying texts
  5. Chapter 5: Getting started with information extraction
  6. Chapter 6: Building chatbots
  7. Chapter 7: Topic modeling
  8. Chapter 8: Visualizing text data

Here is a preview of what a recipe will look like in the book. This is from the recipe Training your own embeddings model in Chapter 3.

How to do it

5. Get the books sentences:

sentences = get_all_book_sentences(books_dir)

6. Tokenize and lowercase all the sentences:

sentences = [tokenize_nltk(s.lower()) for s in sentences]

7. Train the model. This step will take several minutes to run.

model = train_word2vec(sentences, word2vec_model_path)

8. We can now see which most similar words the model returns for different input words, for example, river:

w1 = "river"
words = model.wv.most_similar(w1, topn=10)
print(words)

Every time a model is trained, the results will be different. My results look like this:

[('shore', 0.5025173425674438), ('woods', 0.46839720010757446), ('raft', 0.44671306014060974), ('illinois', 0.44637370109558105), ('hill', 0.4400100111961365), ('island', 0.43077412247657776), ('rock', 0.4293714761734009), ('stream', 0.42731013894081116), ('strand', 0.42297834157943726), ('road', 0.41813182830810547)]

How it works

In step 5 we get all the sentences from the books using the above-defined function. In step 6 we lowercase and tokenize them into words. In step 7 we train the word2vec model using the train_word2vec function; this will take several minutes. In step 8 we print out words that are most similar to the word river using the newly trained model. Since the model is different every time you train it, your results will be different from mine, but the resulting words should still be similar to river in the sense that they will be about nature.

There is more

There are tools to evaluate a word2vec model, although its creation is unsupervised. gensim comes with a file that lists word analogies, such as Athens to Greece is the same as Moscow to Russia. The evaluate_word_analogies function runs the analogies through the model and calculates how many were correct.

Here is how to do this.

5. Load the previously pickled model.

model = pickle.load(open(word2vec_model_path, 'rb'))

6. Evaluate the model against the provided file. This file is available in the book GitHub repository at Chapter03/questions-words.txt.

(analogy_score, word_list) = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

7. The score is the ratio of correct analogies, so a score of 1 would mean that all analogies were correctly answered using the model, and a score of 0 means that none were. The word list is the detailed breakdown by word analogy. We can print the analogy score.

print(analogy_score)

The result will be:

0.20059045432179762

All the recipes have structure similar to the recipe above.

If you read the book, I would love to know what you thought about it and what you found useful. Please leave a review on Amazon once you’re done reading. If you have any questions, feel free to reach out on LinkedIn or email at zhenya@practicallinguistics.com.

Leave a Comment

Your email address will not be published. Required fields are marked *