Need help with data files and setup.py

I’m working on a package that includes some files that are meant to be copied and edited by people using the package.

My project is named “pitz” and it is a bugtracker. Instead of using a config file to set the options for a project, I want to use python files.

When somebody installs pitz, I want to save some .py files somewhere so that when they run my pitz-setup script, I can go find those .py files and copy them into their working directory.

I have two questions:

  1. Do I need to write my setup.py file to specify that the .py files in particular directory need to be treated like data, not code? For example, I don’t want the installer to hide those files inside an egg.
  2. How can I find those .py files later and copy them?

Here’s my setup.py so far:

from setuptools import setup, find_packages
version = '0.1'
setup(name='pitz',
version=version,
description="Python to-do tracker inspired by ditz (ditz.rubyforge.org)",

long_description="""\
ditz (http://ditz.rubyforge.org) is the best distributed ticketing
system that I know of. There's a few things I want to change, so I
started pitz.""",

classifiers=[],
keywords='ditz',
author='Matt Wilson',
author_email='[email protected]',
url='http://tplus1.com',
license='',
packages=find_packages(exclude=['ez_setup', 'examples', 'tests']),

include_package_data=True,
package_dir={'pitz':'pitz'},

data_files=[('share/pitz',
[
'pitz/pitztypes/agilepitz.py.sample',
'pitz/pitztypes/tracpitz.py.sample',
])],

zip_safe=False,
install_requires=[
# 'PyYAML',
# 'sphinx',
# 'nose',
# 'jinja2',
# -*- Extra requirements: -*-
],

# I know about the much fancier entry points, but I prefer this
# solution. Why does everything have to be zany?
scripts = ['scripts/pitz-shell'],

test_suite = 'nose.collector',
)

When I run python setup.py install, I do get those .sample files copied, but they get copied into a folder way inside of my pitz install:

$ cd ~/virtualenvs/scratch/lib/
$ find -type f -name '*.sample'
./python2.6/site-packages/pitz-0.1dev-py2.6.egg/share/pitz/tracpitz.py.sample
./python2.6/site-packages/pitz-0.1dev-py2.6.egg/share/pitz/agilepitz.py.sample

I don’t know how I can write a script to copy those tracpitz.py.sample files out. Maybe I can ask pitz what its version is, and then build a tring and use os.path.join, but that doesn’t look like any fun at all.

So, what should I do instead?

Define your validation schema inline

The TurboGears docs show how to assign validators for individual parameters in the validate decorator like this:

@validate(validators={'a':validators.Int(), 'b':validators.DateConverter()})
@error_handler()
def f(self, a, b, tg_errors=None):
# Now a is already an integer and b is already a datetime.date object,
# unless there were some validation errors.

That’s great, but there are some validations that depend on numerous parameters at the same time. For example, you might want to make sure that an employee’s hire date precedes the termination date.

I already knew how to subclass validators.Schema to do this, and then pass that instance into the validate decorator like this:

class MattSchema(validators.Schema):
a = validators.Int()
b = validators.DateConverter()
chained_validators = [blah] # pretend that blah does some compound validation.

@validate(validators=MattSchema())
def f(self, a, b)

This approach is fine, but today I discovered that it is also possible to define a Schema inline, inside the validate decorator, and specify the chained_validators right there, like this:

@expose('.templates.shiftreports.overtime')
@validate(validators=validators.Schema(
a=validators.Int(),
dt=validators.DateConverter(),
chained_validators=[blah]),
state_factory=matt_state_factory)
def f(self, a, b):

What’s the point? Well, it seems wasteful to define a class and hide it in another file if that schema is only going to be used for exactly one controller. Also, this makes it really fast for me to mix and match comound validators with controllers. I don’t need to pop open my separate validators file where all my elaborate schemas live. I can define them right here.

I’m very forgetful too, so I like to keep my code shallow so that I can instantly see what the heck something does. With all the validators right there, I can easily figure out what the system intends to do.

However, I would define a Schema subclass as soon as I see that I need the same thing twice.

I’m happy that the FormEncode authors had the foresight to support this inline approach along with the declarative style.

Using state with FormEncode and TG’s validate decorator

I believe I figured out a way to reduce a few redundant lines from my controller methods. I’m looking for opinions about whether this is a wise idea.

At the top of nearly every method in my controllers, I look up the current user and the hospital this user belongs to, sort of like this:

@expose('.templates.m1')
def m1(self):
u = identity.current.user
hospital = u.hospital

Anyhow, I realized I can offload this irritating work to a validator that uses a state factory. Now my method looks like this:

@expose('.templates.m1')
@validate(validators=LookupSchema(), state_factory=my_state_factory)
def m1(self, u=None, hospital=None):

So now all my methods get called with those values already set up. I have to make u and hospital keyword parameters, because otherwise TG will try to pull their values out of the URL.

Here’s how it works. First I make my_state_factory that builds an object that has those values:

def my_state_factory():

class StateBlob(object):
pass

sb = StateBlob()
sb.u = identity.current.user
sb.hospital = u.hospital

return sb

Now the LookupSchema extracts those values out of the state blob object and adds them to the dictionary of values:

from formencode.schema import Schema, SimpleFormValidator
@SimpleFormValidator
def f(value_dict, state, validator):
value_dict['u'] = state.u
value_dict['hospital'] = state.hospital

class LookupSchema(Schema):
allow_extra_fields = True # otherwise, it fusses about self ?!?!?
chained_validators = [f]

So the benefit of all this is that some repetitive code is now just defined in a single place. Also, I’m getting more comfortable with the internals of FormEncode and the TG validate decorator.

Pretty soon, my controllers will be some really skinny methods. All the calculations of new variables based on the original parameters will happen outside the controller. The controller will just handle picking the right template.

Sometimes I think validate + formencode is more hassle than it is worth

I’m hoping somebody will read this and show me a better way.

In general, I like formencode. I like that I can do stuff like:

@validate(validator=SomeGnarlySchema())
def m(self, a, b, c, d, e=None):

And then I know that all my parameters have been converted from their original string values into whatever I want.

But I also find that I spend a lot of time getting my complex schemas to work. Like right now, I have an optional parameter e. e should either be a string representing a date, or it can be None.

I’ve got a validator with this logic in it for e:

  1. First try to return a datetime.date object from parsing e.
  2. Otherwise, look in the cookie for a key “e” and try to return that after parsing it into a datetime.date.
  3. Finally, just return today’s date.

So, the idea is that some visitor can come to page /m and always see data for today. Or, they can use a calendar widget to choose a value. On subsequent visits back to /m, I’ll keep showing them that same date they chose because I saved in it a cookie.

Here’s the problem. I have to make e an optional parameter because I don’t want to require that people hit the site with a url that contains a value for e.

However, when e is None, then my validator for e is ignored! So, as far as I know, at this point, I need to use a validator that operates on the whole set of parameters.

Which is also possible, but in my brain, it seems wrong that I have to use a schema-level validator when I really am only validating one single field.

More generally, anybody that subscribes to the formencode mailing list sees first-hand just how confusing a lot of people find formencode. It is a very powerful library, but very tricky to get right.

Here’s my question — does validate really need to use formencode? Is there some better, simpler solution? I’ve read about how django tackles this problem, and their approach does seem simpler, but I can’t say for sure until I really build something with it.

If any readers can show how to make a form.clean method that does the 1-2-3 logic I described above, I’d be really grateful.

Maybe formencode just needs a fat cookbook of solutions.

How to use tg-admin sql upgrade

The tg-admin script that is bundled with turbogears is really helpful, but I had a hard time learning how to use it.

Before you read any more, you should know that this only works when you use SQLObject, not SQLAlchemy, for your ORM.

These are my notes on how I use tg-admin to upgrade an existing database.

  • I have a production database that uses prod.cfg;
  • I have a development database that uses dev.cfg;
  • Neither databases have a sqlobject_db_version table initially, because I never payed attention to it yet.

The development database has a bunch of new columns, tables, and indexes that I want to add to the production database. For this example, I’ll pretend that all I want to do is add an index to a table.

First, I made sure that the dev database matches sqlobject classes:

tg-admin -c dev.cfg sql status

If those are out of sync, then do whatever you need to do to make sure your actual dev database matches your classes. Of course, tg-admin sql status is not perfect. For example, it overlooks missing indexes and constraints, at least with postgres.

Next, I recorded the state of the development database:

tg-admin -c dev.cfg sql record --force-db-version=2008-03-21

This will make a new table in the dev database called sqlobject_db_version. I am forcing it to have a value of today’s date (March 21st, 2008).

Now I connect to the production database and set a version on it with yesterday’s date:

tg-admin -c prod.cfg sql record --force-db-version=2008-03-20

Now I run this to try to upgrade the production database to match the development database:

tg-admin -c prod.cfg sql upgrade

Of course, that should fail, and I see some error message sort of like this:

$ tg-admin -c prod.cfg sql upgrade
Using database URI postgres://staffknex:staffknex@localhost/staffknex320
No way to upgrade from 2008-03-20 to 2008-03-21
(you need a 2008-03-20/upgrade_postgres_2008-03-21.sql script)

This is an example of a helpful error message. I need to write a script that will explain how to upgrade from yesterday’s version to today’s version.

That script will be really simple:

BEGIN;
CREATE UNIQUE INDEX majestic12 ON ufo_theorists (first_name, last_name);
END;

I suggest using BEGIN and END so that in case something goes wrong in the middle, your transaction will be rolled back automatically.

Now I can run this:

tg-admin -c prod.cfg sql upgrade

And my production database will be upgraded with the new index.

Now for some complaints:

  • Why isn’t this advertised better? This is a really nice feature.
  • You’re supposed to be able to specify the URI on the command-line with the –connection option, but I could never get it to work.
  • I really wish that tg-admin sql status detected stuff like missing indexes and constraints. I use these things heavily.
  • It would be nice to be able to mix python into the upgrade script, rather than just SQL. For example, I recently dropped a column that had both an employee’s first and last name, and separated this into two columns. I used SQL to make the new columns, then I used python to read data out of the old single column and write it to the two new columns. Then I used SQL again to drop the old column.

Like I said at the beginning, this is a really helpful script and I’m very grateful to whoever wrote it.

A few rules I try to follow with TurboGears

These are a few of the rules I try to follow in my design. So far, they’ve helped me out.

I aim to finish all interaction with the database before I get to the template layer.

This is non-trivial because it is so easy to forget that a method or an attribute will evaluate into a query. I use this rule because it lets me be certain about the number of interactions each page will have with the database.

I avoid branching (if-else clause) in my templates as much as possible.

I have a really hard time detangling code when I find a bunch of nested if statements. For all but the most trivial instances, I prefer to have a bunch of similar templates and then choose the best one. For example, instead of handling both a successful login and a failed login in a single template, I’ll make two different files and then choose the right one in my controller.

In practice, I have some really similar templates. But then I go back and strip out as much of the common code as possible and put those into widgets.

Any time I find a select() call in my controller, I consider making a new method in my model.

When I write something like this in a controller:

bluebirds = model.Bird.select(Bird.q.color == 'blue')

I usually come back later and put in something like this into the Bird class:

class Bird(SQLObject):
color = UnicodeCol()

@classmethod
def by_color(cls, color)
return cls.select(cls.q.color == color)

Now I have something that I can reuse. If I’m feeling whimsical I’ll use functools.partial to do something like this:

class Bird(SQLObject):
color = UnicodeCol()

def by_color(self, color):
return self.select(self.q.color == color)

redbirds = classmethod(partial(by_color, color='red'))
bluebirds = classmethod(partial(by_color, color='blue'))

Sidenote: I couldn’t figure out how to use the @classmethod decorator in the second version of by_color because partial complained. Appararently, callable(some_class_method) returns False, and partial requires the first argument to be a callable.

Maybe a reader can explain to me what’s going on there…

A few half-formed thoughts on SQLObject

I love SQLObject, but this is a rant about the tiny frustrations I face with it.

First, this is a minor point. I don’t really care about database independence that much. Postgres has a lot of wonderful features: I never have to worry about choosing the table engine that will enforce foreign key constraints, I like creating indexes with function inside:

create unique index nodup_parent on category (org_id, parent_cat, lower(name));

and I really like how easy it is to write stored procedures. Anyway, since I know I’m going to use postgresql, I don’t want to be restricted to only the features that exist or can be emulated in every platform. I know all about sqlmeta and createSQL and use it plenty. But I don’t like how when I set a default value, sometimes it is set in the database table, and other times, it isn’t.

Anyway, in practice, the most dangerous part of using SQLObject is that it hypnotizes you into forgetting about the queries behind everything. Imagine you have employees, departments, and a join table between them. You can set this up in SQLObject like this:

class Employee(SQLobject):
name = UnicodeCol(alternateID=True)
departments = RelatedJoin('Department')

class Department(SQLObject):
name = UnicodeCol(alternateID=True)
employees = RelatedJoin('Employee')

You want to draw a grid that indicates whether each user is a member in every group, so you might dash off some code like this:

for emp in Employee.select():
for d in Department.select():
if d in emp.departments:
print "yes!"
else:
print "no!"

In an ideal scenario, you can do this with three simple queries:

  • You need a list of employees
  • You need a list of departments
  • You need the list of employee-department of associations.

People that talk about how you can use outer joins to cram all that into one query will be dropped into a bottomless pit. Besides, I profiled it, and three separate queries is often much cheaper.

Anyway, back to the point. SQLObject will only run a single query to get the employees and a separate single query to get all the departments. So that’s good.

However, the place where all hell breaks loose is that if clause in the middle. If we have three employees and four departments, this statement

if d in emp.departments:

executes a dozen times. That’s unavoidable. The problem is that each time it executes, SQLObject runs a query like:

select department_id from department_employee where employee_id = (whatever);

Every time you say “is this particular department in this employee’s list of departments?” SQLObject grabs the full list of departments for that employee. So, if you ask about 10 different departments, you will run the exact same query ten times. Sure, the database is likely to cache the results of the query for you, but it is still very wasteful.

With just a few employees and a few departments, that’s not so bad. Eventually, though, as the number of employees and departments grow, the cost of that code grows at N2, which is just geek slang for sucky.

So, in conclusion, this may sound like a rant, but it really isnt. SQLObject is great. But it isn’t magic. It’s a great scaffolding system. But now I find that I’m rewriting a fair portion of code in order to reduce the database costs.

Aside: when I started paying attention to the queries generated by SQLObject, I found it really useful to edit postgresql.conf and enable log_min_duration_statement. Then every query and its cost will be logged for you. This is really useful stuff. It’s helped me to relax about doing a lot of things that I used to think were really bad.

Possible bug in 1.0.4b3 tag of turbogears

The /visit/api.py file in the 1.0.4b3 tag of turbogears has this function, starting on line 177:

def encode_utf8(params):
'''
will recursively encode to utf-8 all values in a dictionnary
'''
res = dict()
for k, v in params.items():
if type(v) is dict:
res[k] = encode_utf8(v)

else:
res[k] = v.encode('utf-8')

return res

If you have a query string like ?a=1&a=2, then params has a key u’a’ that points to a list that contains u’1′ and u’2′. And encode isn’t defined for lists, so . . .

Fortunately, the /visit/api.py file in the branches/1.0 branch already has a fix for this problem, so I ran setup.py develop in my checkout directory and was back in business.

I lost so much time today figuring this out because I kept looking for the bug in my code, rather than in the framework itself. Also, the code works fine as long as the query string doesn’t have more than one value for the same key.

While I’m on the soapbox, I really wish that testutil.py would change this function:

def tearDown(self):
database.rollback_all()
for item in self._get_soClasses():
if isinstance(item, types.TypeType) and issubclass(item,
sqlobject.SQLObject) and item != sqlobject.SQLObject \
and item != InheritableSQLObject:
item.dropTable(ifExists=True)

to something sort of like this instead:

def tearDown(self):
database.rollback_all()
import copy # Probably don't actually import here, but this is just for illustration.
x = copy.copy(self.__get_soClasses()) # store a copy of the list.
x.reverse() # Now reverse it.
for item in x: # Iterate the reversed copy.
if isinstance(item, types.TypeType) and issubclass(item,
sqlobject.SQLObject) and item != sqlobject.SQLObject \
and item != InheritableSQLObject:
item.dropTable(ifExists=True)

The whole point of using self.__get_soClasses is that it looks for a list that defines the order to follow when creating tables. You can define soClasses in your model to make sure that your independent tables are created before your dependent tables.

Well, when it comes time to destroy all your tables, you should destroy the dependent tables first.

I posted this about a month ago to the turbogears trunk mailing list already.

Sidenote — if you’re one of the people that are selflessly donating your time to working on turbogears, please don’t take my rants here personally. I’m really grateful that other people are building tools and giving them away, so that I can make a living.

MVC Blasphemy

I just put HTML code into my data model. I have a list-of-objects page. Each object is an instance of an object defined in my data model, derived from a row in a database. Each object needs a pretty link drawn that object’s detailed-view page. So I added a property on my object:
class Message(SQLObject):
def _get_view(self):
"Draw a link to the view page for this message."
return cElementTree.XML("""VIEW""" % self.id)
# Lots of other stuff snipped out.

This is now what my kid template looks like:

MESSAGE STUFF

I pass in messages and columns; messages is a list of objects and columns is a tuple of strings that map to attributes or properties, like “view”.

I’m happy with this decision. I know I could have manipulated the messages or created some new classes in my controller, but I couldn’t really see any advantage. This way works.

I just don’t want anyone else doing this 🙂

Found a possible error in chapter 7 of the TurboGears book

I bought the TurboGears book about two weeks ago, and I have been working through it. I like the book in general, but I agree with the reviewers on Amazon that complain about the number of errors. I can’t think of another programming book that I’ve read with this many errors.

All of the errors I noticed are little glitchy typographical errors, rather than incorrect theory. The authors really do a good job of illustrating the MVC approach to web design, so I’m glad I bought it.

Anyway, this page lists mistakes found after publication, and the community of readers seems to be doing a good job of helping each other out.

I think I might have found another tiny error. This code appears at the bottom of page 109:

class ProjectFields(widgets.WidgetsList):
title = TextField(label="project", validator=validators.NotEmpty())
client_revenue = widgets.TextField(validator=validators.Number())
project_form = widgets.TableForm(fields=ProjectFields(), action="save_project_test")

I don’t see the point in using both TextField and widgets.TextField. But more importantly, I think the indentation is wrong in the last line. I don’t think project_form is supposed to be an attribute of the ProjectField class.

I think the code should look more like this:


class ProjectFields(widgets.WidgetsList):
title = widgets.TextField(label="project", validator=validators.NotEmpty())
client_revenue = widgets.TextField(validator=validators.Number())

# Moved outside the class.
project_form = widgets.TableForm(fields=ProjectFields(), action="save_project_test")

But maybe I’m missing something. I posted to the TurboGears Book mailing list, so hopefully I’ll find out.