100 Times Faster Natural Language Processing in Python
How to take advantage of spaCy & a bit of Cython for blazing fast NLP
Published BY: Thomas Wolf
When we published our Python coreference resolution package last year, we got an amazing feedback from the community and people started to use it for many applications , some very different from our original dialog use-case .
And we discovered that, while the speed was totally fine for dialog messages, it could be really slow on larger news articles.
I decided to investigate this in details and the result is NeuralCoref v3.0 which is about 100 times faster than the previous version (several thousands words per seconds) while retaining the same accuracy, and the easiness of use and eco-system of a Python library.
In this post I wanted to share a few lessons learned on this project, and in particular:
- How you can design a high-speed module in Python,
- How you can take advantage of spaCy’s internal data structures to efficiently design super fast NLP functions.
So I am a bit cheating here because we will be talking about Python, but also about some Cython magic — but, you know what? Cython is a superset of Python, so don’t let that scares you away!
Your current Python program is already a Cython program.
There are several cases where you may need such speed-ups, e.g.:
- you are developing a production module for NLP using Python,
- you are computing analytics on a large NLP dataset using Python,
- you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow, or you have a heavy processing logic in your DeepLearning batch loader that slows down your training.
First step to rocket speed: Profiling
The first thing to know is that most of your code is probably just fine in pure Python but there can be a few bottlenecks functions that will get you orders of magnitude faster if you give them some love.
You should thus start by profiling your Python code and find where the slow parts are located. One option is to use cProfile like that:
So, how can we speed up these loops?
Fast Loops in Python with a bit of Cython
Let’s work this out on a simple example. Say we have a large set of rectangles that we store as a list of Python objects, e.g. instances of a
Rectangle
class. The main job of our module is to iterate over this list in order to count how many rectangles have an area larger than a specific threshold.
Our Python module is quite simple and looks like this:
The
check_rectangles
function is our bottleneck! It loops over a large number of Python objects and this can be rather slow as the Python interpreter does a lot of work under the hood at each iteration (looking for the area method in the class, packing and unpacking arguments, calling the Python API….).Here comes Cython to help us speed up our loop.
The Cython language is a superset of Python that contains two kind of objects:
- Python objects are the objects we manipulate in regular Python like numbers, strings, lists, class instances…
- Cython C objects are C or C++ objects like double, int, float, struct, vectors that can be compiled by Cython in super fast low-level code.
A fast loop is simply a loop in a Cython program within which we only access Cython C objects.
Astraightforward approach to designing such a loop is to define C structures that will contain all the things we need during our computation: in our case, the lengths and widths of our rectangles.
We can then store our list of rectangles in a C array of such structures that we will pass to our
check_rectangle
function. This function now has to accept a C array as input and thus will be defined as a Cython function by using the cdef
keyword instead of def
(note that cdef
is also used to define Cython C objects).
Here is how the fast Cython version of our Python module looks like:
First install Cython with
pip install cython
First Tests in Jupyter
Load the
Cython
extension in a Jupyter notebook with %load_ext Cython
.
Now you can write Cython code like Python code by using the magic command
%%cython
.
If you have a compilation error when you execute a Cython cell, be sure to check Jupyter terminal output to see the full message.
Most of the time you’ll be missing a
-+
tag after %%cython
to compile to C++ (for example if you use spaCy Cython API) or an import numpy
if the compiler complains about NumPy.Writing, Using and Distributing Cython Code
Cython code is written in .pyx files. These files are compiled to C or C++ files by the Cython compiler and then to byte-code level with the system’s C compiler. The byte-code level files can then be used by the Python interpreter.
You can load a .pyx file directly in Python by using
pyximport
:>>> import pyximport; pyximport.install() >>> import my_cython_module
You can also build your Cython code as a Python package and import/distribute it as a regular Python package as detailed here. This can take some time to get working, in particular on all platforms. If you need a working example, spaCy’s install script is a rather comprehensive one.
Before we move to some NLP, let’s quickly talk about the
def
, cdef
and cpdef
keywords, because they are the main things you need to grab to start using Cython.
You can use three types of functions in a Cython program:
- Python functions, which are defined with the usual keyword
def
. They take as input and output Python objects. Internally they can use bothPython and C/C++ objects and can call both Cython and Python functions. - Cython functions defined with the
cdef
keyword. They can take as input, use internally and output both Python and C/C++ objects. These functions are not accessible from the Python-space (i.e. the Python interpreter and other pure Python modules that would import your Cython module) but they can be imported by other Cython modules. - Cython functions defined with the
cpdef
keyword are like thecdef
Cython functions but they are also provided with a Python wrapper so they can be called from the Python-space (with Python objects as inputs and outputs) as well as from other Cython modules (with C/C++ or Python objects as inputs).
The
cdef
keyword has another use which is to type Cython C/C++ objects in the code. Unless you type your objects with this keyword, they will be considered as Python objects (and thus slow to access).Using Cython with spaCy to speed up NLP
This is all nice and fast but… we are still not doing NLP here! No string manipulations, no unicode encodings, none of the subtleties we are lucky to have in Natural Language Processing .
And the official Cython documentation even advises against the use of C level strings:
Generally speaking: unless you know what you are doing, avoid using C strings where possible and use Python string objects instead.
So how can we design fast loops in Cython when we work with strings?
spaCy got us covered.
The way spaCy tackle this problem is quite smart.
Convert all strings to 64-bit hashes
All the unicode strings in spaCy (the text of a token, its lower case text, its lemma form, POS tag label, parse tree dependency label, Named-Entity tags…) are stored in a single data structure called the
StringStore
where they are indexed by 64-bit hashes, i.e. C level uint64_t.
It is accessible from everywhere in spaCy and every object (see on the left), for example as
nlp.vocab.strings
, doc.vocab.strings
or span.doc.vocab.string.
When a module needs to perform fast processing on some tokens, it simply uses the C level 64-bit hashes instead of the strings. A call to the
StringStore
look up table will then give back the Python unicode strings associated to the hashes.
But spaCy does more than that and also gives us access to fully populated C level structures of the document and vocabulary, which we can use in Cython loops instead of having to build our own structures.
SpaCy’s internal data structures
The main data structure associated to a spaCy document is the
Doc
object which owns the sequence of tokens (“words”) of the processed string and all their annotations in a C level object called doc.c
which is an array of TokenCstructures.
The TokenC structure contains all the informations we need about each tokens. This information is stored as 64-bit hashes that can be re-associated to unicode strings as we’ve just seen.
To see exactly what’s in these nice C structures, just have a look at the freshly created Cython API doc of spaCy .
Let’s see that in action on a simple example of NLP processing.
๐Fast NLP Processing with spaCy and Cython
Let’s say we have a dataset of text documents we need to analyse.
A Python loop to do that is short and straightforward:
First, we have to think about the data structure. We will need a C level array for the dataset, with pointers to each document’s TokenC array. We’ll also need to convert the test strings we use (“run” and “NN”) to 64-bit hashes.
When all the data required for our processing is in C level objects, we can then iterate at full C speed over the dataset.
Here is how this example can be written in Cython with spaCy:
No comments:
Post a Comment