Jake Vanderplass has a very thought-provoking essay about the future of visualization in Python. It's an exciting time for visualization in Python with so many new options exploding onto the scene, and Jake has provided a nice summary. However, I don't think it presents a very current view of matplotlib, which is still alive and well with funding sources, and moving to "modern" things like web frontends and web services, and has nascent ongoing project related to hardware acceleration. Importantly, it has thousands of person hours of investment in all of the large to tiny problems that have been found along the way.
In the browser
One of the directions that new plotting projects are taking is to be more integrated in the web browser. This has all of the advantages of cloud computing (zero install, distributed data), and integrates well with great projects like the IPython notebook.
matplotlib is already most of the way there. matplotlib's git master has completely interactive and full featured plotting in the browser -- meaning it can do everything any of the other matplotlib backends can do -- by basically running something very similar to a VNC protocol between the server and the client. You can try it out today by building from git and using the WebAgg backend. Shortly, it will also be available as part of Google App Engine -- so we'll get some real experience running these things remotely in a real "Internet-enabled" mode. The integration work with IPython still needs to be done -- and I hope this can be a major focus of discussion at SciPy when I'm there.
The VNC-like approach was ultimately arrived at after many months of experimenting with approaches more based on JS and HTML5 and/or SVG. The main problem one runs into with those approaches is working with large datasets -- matplotlib has some very sophisticated designs to make working working with large data sets zippy and interactive (specifically path simplification, blitting of markers, dynamic down sampling of images) all of which are just really hard to implement efficiently in a browser. D3's Javascript demos feel very zippy and efficient, until you realize how canned they are, or how much they rely on very specific means to shuttle reduced data and from the browser. There's a place for interactive canned graphics as part of web publishing, but it's much less useful for doing science on data for the first time. But in general from these experiments, I've become rather skeptical of approaches that try to do too much in the browser. Even though matplotlib is on the "old" paradigm of running on a server (or a local desktop), the advantage of that approach is that we control the whole stack and can optimize the heck out of the places that need to be optimized. Browsers are much more of a black box in that regard.
I don't know if WebGL will offer some improvements here. It's enough of a moving target that it should probably be re-examined on a regular basis.
On the GPU
And in the diametrically opposite direction, we have projects moving plotting onto the GPU. Particularly interesting to me here is the glagg project by Nicolas Rougier and others. It's important to note for those not in the trenches that for high-quality 2D plotting on the GPU, things are much less straightforward than for 3D. Graphics cards don't "just do" high-quality 2D rendering out of the box. It requires the use of custom vertex shaders (which are frankly works of extreme brilliance and also an exercise somewhat in putting round pegs in square holes and living to tell about it). Unfortunately, these things require rather recent graphics hardware and drivers and a fair bit of patience to get up and running. Things will get easier over time, but at the moment, a 100% software implementation still can't be beat for portability and maximum accessibility for less technically-inclined users. But I look forward to where all of this is going.
Real benchmarking on real data needs to be performed to determine just how much faster these approaches will be for 2D plotting. (For 3D, which I discuss below, I think the advantages of hardware are more apparent).
Note
As a public service announcement, anyone looking for performance out of matplotlib should be using the Agg backend -- it's the only one with all optimizations available. The Mac OS-X Quartz backend is built on a closed source rendering library with some puzzling and surprising performance characteristics. Many of the attempts to speed up that backend involve workarounds for a black box that is not well understood. For the Agg-based backends, however, we control the stack from top-to-bottom and are able to optimize for real-world scientific plotting scenarios.
In 3-dimensions
matplotlib's original focus has always been on 2D. Despite this, notably Benjamin Root and others continue to put a lot of effort into matplotlib's 3D extensions, and they fill a niche for many users who want clean and crisp vector 3D for print, and it's improving all the time. There are, of course, fundamental architectural problems with 3D in matplotlib (most importantly the lack of a proper z-ordering) that limit what can be done. That should be fixable with a few well-placed C/C++ extensions -- I'm not certain that we need go whole hog to the GPU to fix that, but that's certainly the obvious and well-trodden solution. I am concerned that too many of the new 3D projects seem to prioritize interactivity and hardware tricks at the expense of static quality. For this reason, matplotlib may continue to serve for some time as a high-quality print "backend" for some of these other 3-D based projects.
Interfaces
Another interesting direction of experimentation is in the area of user interface and API.
I think matplotlib owes a lot of its success to its shameless "embracing and extending" of the Matlab graphing API. Having taught matplotlib a few times to new users, I'm always impressed by how quickly new users pick it up.
The thing that's a but cruftier and full of inconsistencies is matplotlib's Object-Oriented interface. Things there often follow the pattern that was most easy to implement rather than what would be the most clean from the outside. It's probably due time to start re-evaluating some of those API's and breaking backward compatibility for the sake of more consistency going forward.
The Grammar of Graphics syntax from the R world is really interesting, and I think fills a middle ground. It's more powerful (and I think a little more complex to learn at first) than the Matlab interface, but it has the nice property of self-consistency that once you learn a few things you can easily guess at how to do many others.
Peter Wang's Bokeh project aims to bring Grammar of Graphics to Python, which I think is very cool. Note however, that even there, Bokeh includes a matlab-like interface, as does another plotting project Galry, so mlab is by no means dead.
Doomed to repeat
There are a lot of ways in which matplotlib can be improved. I encourage everyone to look at our MEPS to see some ongoing projects that are being discussed or worked on. There are some large, heroic and important projects there to bring matplotlib forward.
But I think more interestingly for this blog post is to focus on some of the really low-level early architectural decisions that have limited or made it difficult to move matplotlib forward. Importantly, these aren't issues that I'm seeing discussed very often, but they are things I would try to tackle up front if I ever get a case of "2.0-itis" and were starting fresh today. Hopefully these injuries of experience can inform new projects -- or they may inspire someone with loads of time to take on refactoring matplotlib. It would not be impossible to make these changes to matplotlib, but it would take a concerted long-term effort and the ability to break some backward compatibility for the common good.
Generic tree manipulations
matplotlib plots are more-or-less tree structures of objects that are "run" to draw things on the screen. (It isn't strictly a tree, as some cross-referencing is required for things like referring to clip paths etc.) For example, the figure has any number of axes, each of which have lines plotted on them. Drawing involves starting at the figure and visiting each of its axes and each of its lines. All very straightforward. But there is no way to traverse that tree in a generic way to perform manipulations on it.
For example, you might want to have a plot with a number of different-colored lines that you want to make ready for black-and-white publication by changing the lines to have different dash patterns. Or, you might want to go through all of the text and change the font. Or, as much as it personally wouldn't fit my workflow, many people would like a graphical editor that would allow one to traverse the tree of objects in the plot and change properties on them. There's currently no way to do this in a generic way that would work on any plot with any kind of object in it.
I'm thinking what is needed is something like the much-maligned "Document Object Model (DOM)" is needed (if I say "ElementTree" instead, is that more appealing to Pythonistas?) That way, one could traverse this tree in a generic fashion and do all kinds of things without needing to be aware of what specifically is in the plot.
Type-checking, styles, properties, traits
matplotlib predates properties and traits and other things of that ilk, so it, not unreasonably, uses get_ and set_ methods for most things. Beyond the syntactic implications of this (which don't bother me as much as some), they are rather inconsistent. Some are available as keyword arguments to constructors and plotting methods, but it's inconsistent because some must be manually handled by the code while others are handled automatically. Some type-check their arguments immediately, whereas others will blow up on invalid input much later in some deeply nested backtrace. Some are mutable and cause an update of the plot when changed. Some seem mutable, but changing them has no effect. Traits (such as Enthought Traits or something else in that space) would be a great fit for this. It's been examined a few times, and while the idea seems to be a good fit, the implementation was always the stumbling block. But it's high time to look at this again.
Combining this with the tree manipulation suggestion above, we'd be able to do really powerful things like CSS-style styling of plots.
(Didn't I just say above that web browsers weren't well suited, yet I'm stealing some fundamentals of their design here...?)
Related to the above, matplotlib's handling of colors and alpha-blending is all over the map. There needs to be a cleanup effort to make handling consistent throughout. Once that's done, the way forward should be clear to manage CMYK colors internally for formats that support them (e.g. PDF). Ditto on other properties like line styles and marker styles.
Projections and ticking
Ticking is the process by which the positions of the grid lines, ticks and labels are determined. There are a number of third-party projects that build new projections on top of matplotlib (basemap, pywcsgrid2, cartopy to name a few). Unfortunately, they can't really take advantage of the many (and subtly difficult) ticking algorithms in matplotlib because matplotlib's tickers are so firmly grounded in the rectilinear world. matplotlib needs to improve its tickers to be more generic and more usable when the grid is rotated or warped in a myriad of ways so that all of this duplication of effort can be reduced.
Related to this, matplotlib have transformations (which determine how the data is mapped to the Cartesian space on screen), tickers (which determine the positions of grid lines), formatters (which determine how the tick's text label is rendered) and locators (which choose pleasant-looking bounds for the data), all of which work in tandem to produce the labels, ticks and gridlines, but which have no relationship to each other. It should be easier to relate these things together, because you usually want a set that works well together. Phil Elson has done some work in this direction, but there's still more that could be done.
Higher dimensionality
matplotlib's 3D support feels tacked on structurally. It would be better if more parts were agnostic to the dimensionality of the data.
May you live in interesting times
It's really exciting to watch all that's going on, and thanks to Jake Vanderplass for getting this discussion rolling.
Comments
comments powered by Disqus