Home

Blog has moved! Please, update your links.

Backwards and forwards compatibility is an art. In the very basic and generic form, it consists in organizing the introduction of new concepts while allowing people to maintain existing assets working. In some cases, the new concepts introduced are disruptive, in the sense that they prevent the original form of the asset to be preserved completely, and then some careful consideration has to be done for creating a migration path which is technically viable, and which at the same time helps people keeping the process in mind. A great example of what not to do when introducing such disruptive changes has happened in Python recently.

Up to Python 2.5, any strings you put within normal quotes (without a leading character marker in front of it) would be considered to be of the type str, which originally was used for both binary data and textual data, but in modern times it was seen as the type to be used for binary data only. For textual information, the unicode type has been introduced in Python 2.0, and it provides easy access to all the goodness of Unicode. Besides converting to and from str, it’s also possible to use Unicode literals in the code by preceding the quotes with a leading u character.

This evolution has happened quite cleanly, but it introduced one problem: these two types were both seen as the main way to input textual data in one point in time, and the language syntax clearly makes it very easy to use either type interchangeably. Sounds good in theory, but the types are not interchangeable, and what is worse: in many cases the problem is only seen at runtime when incompatible data passes through the code. This is what gives form to the interminable UnicodeDecodeError problem you may have heard about. So what can be done about this? Enter Python 3.0.

In Python 3.0 an attempt is being made to sanitize this, by promoting the unicode type to a more prominent position, removing the original str type, and introducing a similar but incompatible bytes type which is more clearly oriented towards binary data.

So far so good. The motivation is good, the target goal is a good one too. As usual, the details may complicate things a bit. Before we go into what was actually done, let’s look at an ideal scenario for such an incompatible change.

As mentioned above, when introducing disruptive changes like this, we want a good migration path, and we want to help people keeping the procedure in mind, so that they do the right thing even though they’re not spending too many brain cycles on it. Here is a suggested schema of what might have happened to achieve the above goal: in Python 2.6, introduce the bytes type, with exactly the same semantics of what will be seen in Python 3.0. During 2.6, encourage people to migrate str references in their code to either the previously existent unicode type, when dealing with textual data, or to the new bytes type, when handling binary data. When 3.0 comes along, simply kill the old str types, and we’re done. People can easily write code in 2.6 which supports 3.0, and if they see a reference to str they know something must be done. No big deal, and apparently quite straightforward.

Now, let’s see how to do it in a bad way.

Python 2.6 introduces the bytes type, but it’s not actually a new type. It’s simply an alias to the existing str type. This means that if you write code to support bytes in 2.6, you are actually not writing code which is compatible with Python 3.0. Why on earth would someone introduce an alias on 2.6 which will generate incompatible code with 3.0 is beyond me. It must be some kind of anti-migration pattern. Then, Python 3.0 renames unicode to str, and kills the old str. So, the result is quite bad: Python 3.0 has both str and bytes, and they both mean something else than they did on 2.6, which is the first version which supposedly should help migration, and not a single one of the three types from 2.6 got their names and semantics preserved in 3.0. In fact, just unicode exists at all, and it has a different name.

There you go. I’ve heard people learn better from counter-examples. Here we have a good one to keep in mind and avoid repeating.

Tags:
 
 
16 May 2009 @ 10:02 am

Blog has moved! Please, update your links.

In my previous post I made an open statement which I’d like to clarify a bit further:

(…) when the rules don’t work for people, the rules should be changed, not the people.

This leaves a lot of room for personal interpretation of what was actually meant, and TIm Hoffman pointed that out nicely with the following questioning in a comment:

I wonder when the rule is important enough to change the people though. For instance [, if your] development process is oriented to TDD and people don’t write the tests or do the job poorly will you change them then?

This is indeed a nice scenario to explore the idea. If it happens at some point that a team claims to be using TDD, but if in practice no developer actually writes tests first, the rules are clearly not working. If everyone in the team hates doing TDD, enforcing it most probably won’t show its intended benefits, and that was the heart of my comment. You can’t simply keep the rule as is if no one follows it, unless you don’t really care about the outcome of the rule.

One interesting point, though, is that when you have a high level of influence over the environment in which people are, it may be possible to tweak the rules or the processes to adapt to reality, and tweaking the processes may change the way that people feel about the rules as a consequence (arguably, changing people as a side effect).

As a more concrete example, if I found myself in the described scenario, I’d try to understand why TDD is not working, and would try to discuss with the team to see how we should change the process so that it starts to work for us somehow. Maybe what would be needed is more discussion to show the value of TDD, and perhaps some pair programming with people that do TDD very well so that the joy of doing it becomes more visible.

In either case, I wouldn’t be simply asking people “Everyone has to do TDD from now on!“, I’d be tweaking the process so that it feels better and more natural to people. Then, if nothing similar works either, well, let’s change the rule. I’d try to use more conventional unit testing or some other system which people do follow more naturally and that presents similar benefits.

Tags: , ,
 
 

Blog has moved! Please, update your links.

For a long time I’ve been an advocate of Python’s notion of controlling access to private and protected members (attributes, methods, etc) with conventions, by simply naming them like “_name”, with an initial underline.  Even though Python does support the “__name” (with double underscore) for “private” members (this actually mangles the name rather than hiding it), you’ll notice that even this is rarely used in practice, and the largely agreed mantra is that convention should be enough and thus one underscore suffices. This always resonated quite well with me, since I generally prefer to handle situations by agreement rather than enforcement. Well, I’m now changing my opinion.that this works well for this purpose, at least in certain situations.

This methodology may work quite well in situations where the code scope is within a very controlled environment, with one or more teams which follow strictly a single development guideline, and have the power to refactor the affected code base somewhat easily when the original decisions are too limiting.

Having worked on a few major projects now, and some of them being libraries which are used by several teams within the same company or outside, I now perceive that people very often take shortcuts over these decisions for getting their job done quickly. It’s way easier to simply read the code and get to the private guts of a library than to try to get agreement over the right way to do something, or sending a patch with a suggested change which was carefully architected.

Many people by now are probably thinking: “Well, that’s their problem, isn’t it? If their code base breaks on the next upgrade they’ll get burden and won’t be able to upgrade cleanly.”, and I can honestly understand this feeling, since I shared it. But, for a number of reasons, I now understand that this isn’t just their problem, it’s very much my problem too.

Most importantly, on any serious software, these problems will usually come back to the implementors, and many times the problem will have a much larger magnitude by then than they had at the time a change could have been done “the right way” on the implementation, because code dependent on the private bits will have settled.

Most people are optimist by nature and believe that the implementation won’t change, but, of course, one of the reasons why private information is made private in the first place is exactly because the implementor believes that having the freedom to change these details in the future is important, and not rarely there’s already a plan of evolution in place for these private pieces, which may include revamping the implementation entirely for scalability or for other goals.

In the best case, the careless people will get burden on the upgrade and will ask for support or simply won’t upgrade silently, and both cases hurt implementors, because providing support for broken software takes time and energy, and amazingly can even hurt the software image. Lack of upgrades also means more ancient versions in the wild to give support for. Besides these, in the worst case scenario, the careless people have enough influence on the affected project to cause as much burden on it as if the private data was public in the first place.

As much as I’m a believer in handling situation by agreement rather than enforcement, I’m also a believer that when the rules don’t work for people, the rules should be changed, not the people. So my positioning now is that the language supported access constraints (public, protected, private), as available in languages like Java and C++, are a better alternative when compared to convention as used today in Python, since they provide an additional layer of encouragement for people to not break the rules carelessly, and that helps in the maintenance and reuse of software that has greater visibility.

Tags: , ,
 
 
12 August 2008 @ 04:46 am

Blog has moved! Please, update your links.

The underlying concept is very simple: spreadsheets are a way to organize text, numbers and formulas into what might be seen as a natively numeric environment: a matrix. So what would happen if we loosed some of the bolts of the numeric-oriented organization, and tried to reuse the same concepts into a more formatting-oriented environment which is naturally collaborative: a wiki.

While I do encourage you to answer this with some fantastic new online service (please provide me with an account and the best e-book reader device available once you’re rich) I had a try at answering this question myself a while ago by writing the Calc macro for Moin.

Basically, the Calc macro allows extracting values found in a wiki page into lists (think columns or rows), and applying formulas and further formatting as wanted.

I believe there’s a lot of potential on the basic concept, and the prototype, even though functional and useful, surely has a lot to evolve, so I’ve published the project in Launchpad to make contributions easier. I actually apologize for not publishing it earlier. There was hope that more features would be implemented before releasing, but now it’s clear that it won’t get many improvements from me anytime soon. If you do decide to improve it, please try to prepare patches which are mostly ready for integration, including full testing, since I can’t dedicate much time for it myself in the foreseeable future.

 
 

Blog has moved! Please, update your links.

As everyone is probably aware by now, in Python 3 dict.keys(), dict.values() and dict.items() will all return iterable views instead of lists. The standard way being suggested to overcome the difference, when the original behavior was actually intended, is to simply use list(dict.keys()). This should be usually fine, but not in all cases.

One of the reasons why someone might actually opt to perform a more expensive copying operation is because, with the pre-3.0 semantics, the keys() method is atomic, in the sense that the whole operation of converting all dictionary keys to a list is done while the global interpreter lock is held. Thus, it’s thread-safe to run dict.keys() with Python 2.X.

The suggested replacement in Python 3, list(dict.keys()), is not. There’s a chance that the interpreter will give another thread a chance to run before or during the iteration of the view, and this will cause an exception if the dictionary is modified at the same time. To fix the problem, either a lock must protect the iteration, or a more expensive operation such as dict.copy().keys() must be used.

The 2to3 tool won’t help you there, unfortunately. So, keep an eye on it!

Tags:
 
 
20 May 2008 @ 10:54 pm

Blog has moved! Please, update your links.

According to Dave Troy, Google seems to be using the Geohash algorithm:

Google is employing the GeoHash algorithm I’ve been pushing to do spatial searching using BigTable. Since database schemes like BigTable don’t support traditional GIS extensions/spatial indexes, GeoHash allows for a simple bounding box search using truncated GeoHash substrings. I will post separately about this shortly, as I am working on some GeoHash tools to expand this functionality. This is of particular interest to AppEngine developers.

Nice!

 
 
03 March 2008 @ 12:49 am

Blog has moved! Please, update your links.

Friday I’ve released version 1.4 of dateutil. There are some interesting fixes there, so please upgrade if you have the chance.

 
 
01 March 2008 @ 06:27 pm

Blog has moved! Please, update your links.

Some improvements to geohash.org were made. Some of them were
motivated by a conversation with Rodrigo Stulzer.

  • Support for geocoding addresses (city names, whatever). E.g. http://geohash.org/?q=21 Millbank, London
  • Support for moving the Geohash marker in the embedded map, so that modifying the position visually is easier.
  • Support for providing a “name” to Geohashes, by appending a colon and the name, in a nice format. E.g. http://geohash.org/c216ne:Mt_Hood
  • Provided a bookmark to get a Geohash while in Google Maps.
  • Provided a Google Maps Mapplet. When enabled, it adds a Geohash marker identifying the Geohash position in Google Maps, and it may be moved around. Here is a screenshot:

Check out the Tips & Tricks page for details on these features.

 
 
26 February 2008 @ 09:11 pm

Blog has moved! Please, update your links.

After about one year writing this service in my spare time, it’s finally out.

geohash.org offers short URLs which encode a latitude/longitude pair, so that referencing them in emails, forums, and websites is more convenient.

Geohashes offer properties like arbitrary precision, similar prefixes for nearby positions, and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision). I’ve put the algorithm created in the public domain. Some details may be seen in the Wikipedia article about it (hopefully that’ll help establishing prior art, and prevent Microsoft from patenting it).

To obtain the Geohash, the user provides latitude and longitude coordinates in a single input box (most commonly used formats for latitude and longitude pairs are accepted), and performs the request.

Besides showing the latitude and longitude corresponding to the given Geohash, users who navigate to a Geohash at geohash.org are also presented with an embedded map, and may download a GPX file, or transfer the waypoint directly to certain GPS receivers. Links are also provided to external sites that may provide further details around the specified location.

 
 

Blog has moved! Please, update your links.

Mocker 0.10 is out, with a number of improvements!

While we’re talking about Mocker, here is another interesting use case, exploring a pretty unique feature it offers.

Suppose we want to test that a method hello() on an object will call self.show(”Hello world!”) at some point. Let’s say that the code we want to test is this:

 class Greeting(object):

     def show(self, sentence):
         print sentence

     def hello(self):
         self.show("Hello world!")

This is the entire test method:

def test_hello(self):
    # Define expectation.
    mock = self.mocker.patch(Greeting)
    mock.show("Hello world!")
    self.mocker.replay()

    # Rock on!
    Greeting().hello()

This has helped me in practice a few times already, when testing some involved situations.

Note that you can also passthrough the call. In other words, the call may actually be made on the real method, and mocker will just assert that the call was really made, whatever the effect is.

One more important point: mocker ensures that the real method exists in the real object, and has a specification compatible with the call made. If it doesn’t, and assertion error is raised in the test with a nice error message.

UPDATE: The method for doing this is actually mocker.patch() rather than mocker.mock(), as documented. Apologies.

 
 
22 November 2007 @ 08:27 pm

Blog has moved! Please, update your links.

One neat feature which Mocker offers is the ability to very easily implement custom behavior on specific functions or methods.

Take for instance the case where you want to pretend to some code that a given file exists, but you don’t want to get on the way of everything else which needs the same function:

>>> from mocker import *
>>> mocker = Mocker()
>>> isfile = mocker.replace("os.path.isfile", count=False)
>>> _ = expect(isfile("/non/existent")).result(True)
>>> _ = expect(isfile(ANY)).passthrough()

>>> mocker.replay()

>>> import os
>>> os.path.isfile("/non/existent")
True
>>> os.path.isfile("/etc/passwd")
True
>>> os.path.isfile("/other")
False

>>> mocker.restore()

>>> os.path.isfile("/non/existent")
False

Notice that the count=False parameter is available in version 0.9.2. Without it Mocker will act in a more mocking-strict way and enforce that the given expressions should be executed precisely the given number of times (which defaults to one, and may be modified with the count() method).

 
 
19 November 2007 @ 07:24 pm

Blog has moved! Please, update your links.

A couple of additional releases tonight: dateutil 1.3, and nicefloat 1.1.

They’re both bug fixing releases.

 
 
17 November 2007 @ 06:01 pm

Blog has moved! Please, update your links.

A few more improvements were made to Mocker.

 
 

Blog has moved! Please, update your links.

I’ve recently seen some comments here and there about the lack of connection pooling as an argument for Storm to be faster, and that once this is supported it will be slower, or even as a reason for people not to use Storm at all.

So, let me kill this argument here, at once.

We have not developed Storm only for toy projects that take 10 connections a day. We have developed Storm for heavy duty web sites like Landscape and Launchpad, and we’re proud to see it being used not only in our systems, but also out there in the wild, like for instance in large scale sites developed by the fantastic guys at Lovely Systems.

So how does the connection reuse work in practice, you ask. Here is how:

In Storm, the database is abstracted behind a small, simple, and flexible API, offered in the Store class. You use an instance of this class to deal with objects coming from a given database, and this instance will handle several aspects of your interaction with the database, such as committing, rolling back, caching, ensuring that a given row in the database maps to a single instance in memory, control of dirty objects, flushing, and so on. Pretty much all of these aspects require a correct transactional behavior to work well, and in practice this means we’ve decided that to maintain the API nice and clean, each Store is internally associated with a single Connection object. You can have as many stores as you want, connecting to the same database or to different ones, and using the same model class or entirely different code bases.

So, to summarize the above paragraph, a simple Store instance is your portal to the database. You need one of these instances around to add objects to the database (Storm won’t guess which Store you want to add things to), and to retrieve objects from it.

Considering that, if you want to reuse a connection, it’s very simple: keep your Store instance around. That’s even a strange advice, since you’re already doing that if you’re using Storm in the first place. The code in trunk, which is about to be released as version 0.12, even handles reconnections for you gracefully, including correct transactional behavior.

We even offer a tool that deals with more advanced Store management in a very comfortable way for Zope 3. In the future, we’re likely to offer the same kind of facility in a more generic API.

So, connection reuse is there, and we have always benefited from it. Connection pooling? No, thanks. We’re doing very well without the complexity and overhead.

 
 
11 November 2007 @ 11:17 pm

Blog has moved! Please, update your links.

After being bored for a long time for the lack of a better infrastructure for creating test doubles in Python, I decided to give it a go.

I’m actually quite happy with what came out.. it took me about four weekends (was developed as a personal project), and I’ll dare to say that it’s the best mocking system for Python at the present time. Not only that, but it has features that I’ve not seen in any other mocking/stubing infrastructure, independent of language.

Here’s a feature list to catch your attention:

  • Graceful platform for test doubles in Python (mocks, stubs, fakes, and dummies).
  • Inspiration from real needs, and also from pmock, jmock, pymock, easymock, etc.
  • Expectation of expressions defined by actually using mock objects.
  • Expressions may be replayed in any order by default,
  • Trivial specification of ordering between expressions when wanted.
  • Nice parameter matching for defining expectations on method calls.
  • Good error messages when expectations are broken.
  • Mocking of many kinds of expressions (getting/setting/deleting attributes, calling, iteration, containment, etc)
  • Graceful handling of nested expressions (e.g. ”person.details.get_phone().get_prefix()”)
  • Mock ”proxies”, which allow passing through to the real object on specified expressions (e.g. useful with ”os.path.isfile()”).
  • Mocking via temporary ”patching” of existent classes and instances.
  • Trivial mocking of any external module (e.g. ”time.time()”) via ”proxy replacement”.
  • Mock objects may have method calls checked for conformance with real class/instance to prevent API divergence.
  • Type simulation for using mocks while still performing certain type-checking operations.
  • Nice (optional) integration with ”unittest.TestCase”, including additional assertions (e.g. ”assertIs”, ”assertIn”, etc).
  • More …

Worked? Check it out!

 
 
17 October 2007 @ 02:16 pm

Blog has moved! Please, update your links.

As Chris Armstrong pointed out yesterday, os.environ.pop() is broken in Python versions at least up to 2.5. The method will simply remove the entry from the in-memory dictionary which holds a copy of the environment:

>>> import os
>>> os.system("echo $ASD")

0
>>> os.environ["ASD"] = "asd"
>>> os.system("echo $ASD")
asd
0
>>> os.environ.pop("ASD")
'asd'
>>> os.system("echo $ASD")
asd
0

I can understand that the interface of dictionaries has evolved since os.environ was originally planned, and the os.environ.pop method was overlooked for a while. What surprises me a bit, though, is why it was originally designed the way it is. First, the interface will completely ignore new methods added to the dictionary interface, and they will apparently work. Then, why use a copy of the environment in the first place? This will mean that any changes to the real environment are not seen.

This sounds like something somewhat simple to do right. Here is a working hack using ctypes to show an example of the behavior I’d expect out of os.environ (Python 2.5 on Ubuntu Linux):

from ctypes import cdll, c_char_p, POINTER
from UserDict import DictMixin
import os

c_char_pp = POINTER(c_char_p)

class Environ(DictMixin):

    def __init__(self):
        self._process = cdll.LoadLibrary(None)
        self._getenv = self._process.getenv
        self._getenv.restype = c_char_p
        self._getenv.argtypes = [c_char_p]

    def keys(self):
        result = []
        environ = c_char_pp.in_dll(self._process, "environ")
        i = 0
        while environ[i]:
            result.append(environ[i].split("=", 1)[0])
            i += 1
        return result

    def __getitem__(self, key):
        value = self._getenv(key)
        if value is None:
            raise KeyError(key)
        return value

    def __setitem__(self, key, value):
        os.putenv(key, value)

    def __delitem__(self, key):
        os.unsetenv(key)

I may be missing some implementation detail which would explain the original design. If not, I suggest we just change the implementation to something equivalent (without ctypes).

 
 
15 August 2007 @ 01:00 pm

Blog has moved! Please, update your links.

Finally, a couple of projects I’ve been working on in the last year and a half have been made public, which means that I have more freedom to talk about them openly.

Landscape

Landscape is a system we’ve created to allow administrators to comfortably manage and observe a large number of computers remotely through a centralized web interface.

This description certainly won’t strike anyone as a brand new idea. There are indeed a large number of systems for remote management. Even then, Landscape does bring new ideas into that known field, such as a very flexible package management offering. Landscape, supporting only Ubuntu at the present moment, also has the advantage of being built inside the company which supports the operating system distribution itself.

There are currently 5 core developers, with many other people contributing in various areas. My role is being a Technical Lead, even though that says very little about the kind of relationship that we have within the project. The guys I work with are very smart and goal oriented, so decisions are taken through friendly discussions and consensus, and initiative is seen coming from all directions.

Storm

Storm is a ORM we have developed for Python, to be used in Landscape, Launchpad, and other projects. The project was originally started because our attempts to perform client side partitioning (sharding) of data with existent ORMs for Python failed.

It was announced as an open source project in a talk I presented last month at EuroPython, and last week the second public release (0.10) was already made.

If you are around the Boston area in the US, my coworker and friend Christopher Armstrong will be giving a Storm talk at the Cambridge Python Meetup today. I’ll also be presenting it again at PyCon Brasil at the end of the month, in Joinville, Brazil.

 
 
26 June 2007 @ 08:02 pm

Blog has moved! Please, update your links.

python-dateutil version 1.2 has just been released.

It includes the following changes:

  • Now tzfile will round timezones to full-minutes if necessary, since Python’s datetime doesn’t support sub-minute offsets (reported by Ilpo Nyyssönen).
  • Removed bare string exceptions (reported and fixed by Wilfredo Sánchez Vega)
  • Fixed bug in leap count parsing (reported and fixed by Eugene Oden).
 
 
20 May 2007 @ 07:06 pm

Blog has moved! Please, update your links.

Smart 0.51 has been released today. It includes a few bug fixes and some minor updates.

Shortly after the release, I’ve added a couple of new hooks on Smart’s trunk as well: cache-loaded, and cache-loaded-pre-link. These should enable people to write plugins that hack the cache for specific purposes. Axel Thimm has requested these for a while to introduce kernel-related upgrades. Hopefully these will fulfill his needs.

This release took a while.. probably because I’ve been quite immersed in our current project at Canonical, traveling very frequently, and without much time to blog or to do some of the usual open source activities I used to. The good thing is that we’re getting very close to the public announcement, and some of the work we’ve been doing will be released as open source, so we’re all likely to get more community-oriented interactions again.

 
 
10 March 2007 @ 01:12 am

Blog has moved! Please, update your links.

brother…

My brother Diogo is in town! Good to see him after so much time.

pycon…

PyCon 2007 was fantastic. It was great to meet everyone there, and we had two awesome sprinting weeks around it.

confluence…

I’ve recently visited a confluence with a good friend of mine. Kayaks, paddling, walking, driving, swimming, aslphalt, sand, water, grass.. it was awesome.

svn2bzr…

It looks like Bazaar tags are now really coming, so I’m doing some work on svn2bzr again. Hopefully this time I’ll really migrate some projects over.

editmoin…

Version 1.9 of editmoin was released.

smart…

Some work in Smart is coming in the upcoming weeks.

projects…

Hopefully I’ll be able to speak more openly about (some of the) interesting things I’ve been working on in the near future.