I wrote some ruby code today

I’ve been using ditz for a few months now. It’s a bugtracking system like trac or bugzilla except that it doesn’t run in a centralized server. Instead, it lives as a bunch of text files inside your SCM.

Ditz is written in ruby, and it uses yaml files to store all the issues (tickets) and groupings of issues. Ditz allows issues to be grouped into releases.

$ ditz releases
0.6 (unreleased)
0.4 (released 2008-07-27)
0.5 (released 2008-08-20)

I wrote an extension that would show all the issues attached to a particular release, like this:

$ ditz help ri
Show issues for a particular release.
Usage: ditz ri

$ ditz ri
Error: command 'ri' requires a release

$ ditz ri bogus
Error: no release with name bogus

$ ditz ri 0.6
x ditz-61: Use text editor for multiline input where possible.
x ditz-72: add model object post-creation validation
x ditz-71: 'ditz add' shouldn't ask for comments
x ditz-76: allow configuration of whether the editor is used or not
x ditz-69: Store issues in .ditz directory by default
_ ditz-42: support tiny issue identifiers (like #34) in the single-component case
x sheila-1: check for a "git push" having updated the issue db, and reload if so
.... lots more issues snipped for brevity.

Thanks to all the really neat plumbing already built into ditz, my patch was trivial to write:

operation :ri, "Show issues for a particular release", :release do
end
def ri project, config, opts, release
puts todo_list_for(release.issues_from(project))
end

Ruby is a pretty neat language and people do neat stuff with symbols. In that code above, the operation method takes the symbol :ri and effectively decorates my ri method with the help text. I’m really impressed by how ditz took the fact that my ri method takes a :release symbol as a parameter, and because of that, it knew how to search in the list of releases and then give me the relevant release.

I learned some neat stuff at clepy last night

Brian Beck showed how to use metaclasses and descriptors to make DSLs with python.

I do this kind of this kind of thing every so often in my code:

def f(x):
class C(object):
y = x
return C

That function takes a parameter and makes and returns a class based on that parameter. Whoop-di-do. I was surprised to learn that you can’t do this:

class C(object):
x = 99
class D(object):
y = x + 1

I gotta explore this some more until it makes sense.

Here’s another neat trick: It isn’t possible to add two classes together:

>>> class C(object):
... pass
...
>>> C + C
------------------------------------------------------------
Traceback (most recent call last):
File "", line 1, in
TypeError: unsupported operand type(s) for +: 'type' and 'type'

But if you want to support this, the solution would be to define an __add__ method on the metaclass:

>>> type(C)

>>> class MC(type):
... def __add__(self, other):
... print 'Adding!'
... return 99
...
>>> class C(object):
... __metaclass__ = MC
...
>>> C + C
Adding!
99

Wacky, right? More realistically, I could build a new class by taking attributes of both classes together. In other words, if class C has a class attribute x, and class D has a class attribute y, then we can use a metaclass to add C and D together to get a new class E, that has both x and y as class attributes.

In this example, C has a class attribute x and D has a class attribute y. When I add the two classes, I get a new class with both of those class attributes.

>>> C.x, D.y
(99, 98)
>>> E = C + D
>>> E.x, E.y
(99, 98)

Here’s the metaclass that allows this sort of nonsense:

class MC(type):

def __add__(self, other):

class E(self):
pass

for k,v in other.__dict__.items():
if k not in ('__dict__', ):
setattr(E, k, v)

return E

Break up changes into different commits with git add -p

This guy’s post led to this one.

I’m irresponsible about committing after each conceptual unit of work. Lots of time, I’ll edit a file to fix one bug, then while I’m in there, I’ll edit some other code because I see a better way to do something else. Then maybe I’ll add a few doctests to a completely different section because I feel like it.

After a few hours, I have edits in a single file that are related to multiple separate tasks. So back when I used svn, I would commit it with a message like “Fixed topics A, B, C”. Or I would say “Fixed A and a bunch of other stuff”.

Now with git, before I commit my changes, I run:

$ git add -p frob.py

Then git opens up an interactive session that walks through all the changes in that file and asks me if I want to stage each one. It is also possible to look at every change across a repository if you want to — don’t specify the file or files you want to see.

In the first pass, I stage all the hunks related to the first issue. Then I commit those changes. Then I repeat the process and stage chunks related to the next issue.

Keep in mind that I committed my changes after the first pass, so when I go through the file the second time, I won’t get prompted for those changes.

A real-world example

I’ve got two edits in mkinstall.py. One is a change to the list of files I want to ignore, and the other edit is a silly stylistic change. I want to commit them separately.

$ git diff mkinstall.py
diff --git a/mkinstall.py b/mkinstall.py
index 4c6bb4e..2cb43de 100644
--- a/mkinstall.py
+++ b/mkinstall.py
@@ -17,7 +17,8 @@ Otherwise, I'll add a symlink.
import os, shutil

# Anything you want to skip:
-skip_us = ["mkinstall.py", ".svn", "_vimrc"]
+skip_us = ["mkinstall.py", ".svn", "_vimrc", "diffwrap.sh", "lib",
+ "lynx_bookmarks.html", "ipythonrc-matt", ".git"]

# Anything you want to copy rather than symlink to:
copy_us = [".vim"]
@@ -60,7 +61,8 @@ for thing in copy_us:
if os.path.islink(homefile):
print "A symbolic link to %s exists already, so I'm not going to copy over it." % homefile

- elif os.path.exists(homefile): continue
+ elif os.path.exists(homefile):
+ continue

else:
svnfile = os.path.join(svnpath, thing)

This is what happens when I run git add -p:

$ git add -p mkinstall.py
diff --git a/mkinstall.py b/mkinstall.py
index 4c6bb4e..2cb43de 100644
--- a/mkinstall.py
+++ b/mkinstall.py
@@ -17,7 +17,8 @@ Otherwise, I'll add a symlink.
import os, shutil

# Anything you want to skip:
-skip_us = ["mkinstall.py", ".svn", "_vimrc"]
+skip_us = ["mkinstall.py", ".svn", "_vimrc", "diffwrap.sh", "lib",
+ "lynx_bookmarks.html", "ipythonrc-matt", ".git"]

# Anything you want to copy rather than symlink to:
copy_us = [".vim"]
Stage this hunk [y/n/a/d/j/J/?]?

At this point, I will hit y. Now that section of the file is staged to be committed. That is not the same as committing it.
Now git shows the next section of code that is different:

Stage this hunk [y/n/a/d/j/J/?]? y
@@ -60,7 +61,8 @@ for thing in copy_us:
if os.path.islink(homefile):
print "A symbolic link to %s exists already, so I'm not going to copy over it." % homefile

- elif os.path.exists(homefile): continue
+ elif os.path.exists(homefile):
+ continue

else:
svnfile = os.path.join(svnpath, thing)
Stage this hunk [y/n/a/d/K/?]?

I don’t want to stage this right now, so I hit n. That’s the last edit in the file, so the interactive session completes. Now when I run git diff –cached, which tells me what is about to be committed, look what I see:

$ git diff --cached mkinstall.py
diff --git a/mkinstall.py b/mkinstall.py
index 4c6bb4e..a348ee1 100644
--- a/mkinstall.py
+++ b/mkinstall.py
@@ -17,7 +17,8 @@ Otherwise, I'll add a symlink.
import os, shutil

# Anything you want to skip:
-skip_us = ["mkinstall.py", ".svn", "_vimrc"]
+skip_us = ["mkinstall.py", ".svn", "_vimrc", "diffwrap.sh", "lib",
+ "lynx_bookmarks.html", "ipythonrc-matt", ".git"]

# Anything you want to copy rather than symlink to:
copy_us = [".vim"]

So now I’ll commit this edit with an appropriate remark:

$ git commit -m "Added some more files to the list of files to be skipped"
Created commit ce0478d: Added some more files to the list of files to be skipped
1 files changed, 2 insertions(+), 1 deletions(-)

Now I’ll view the unstaged changes again in my file, and notice that the other change still remains:

$ git diff mkinstall.py
diff --git a/mkinstall.py b/mkinstall.py
index a348ee1..2cb43de 100644
--- a/mkinstall.py
+++ b/mkinstall.py
@@ -61,7 +61,8 @@ for thing in copy_us:
if os.path.islink(homefile):
print "A symbolic link to %s exists already, so I'm not going to copy over it." % homefile

- elif os.path.exists(homefile): continue
+ elif os.path.exists(homefile):
+ continue

else:
svnfile = os.path.join(svnpath, thing)

At this point, I can rerun git add -p and stage up more stuff to be committed. In this case, it is more realistic that I would run

git commit -a -m "Made a silly style change"

That will stage and commit that last edit in one swoop.

Sometimes I think validate + formencode is more hassle than it is worth

I’m hoping somebody will read this and show me a better way.

In general, I like formencode. I like that I can do stuff like:

@validate(validator=SomeGnarlySchema())
def m(self, a, b, c, d, e=None):

And then I know that all my parameters have been converted from their original string values into whatever I want.

But I also find that I spend a lot of time getting my complex schemas to work. Like right now, I have an optional parameter e. e should either be a string representing a date, or it can be None.

I’ve got a validator with this logic in it for e:

  1. First try to return a datetime.date object from parsing e.
  2. Otherwise, look in the cookie for a key “e” and try to return that after parsing it into a datetime.date.
  3. Finally, just return today’s date.

So, the idea is that some visitor can come to page /m and always see data for today. Or, they can use a calendar widget to choose a value. On subsequent visits back to /m, I’ll keep showing them that same date they chose because I saved in it a cookie.

Here’s the problem. I have to make e an optional parameter because I don’t want to require that people hit the site with a url that contains a value for e.

However, when e is None, then my validator for e is ignored! So, as far as I know, at this point, I need to use a validator that operates on the whole set of parameters.

Which is also possible, but in my brain, it seems wrong that I have to use a schema-level validator when I really am only validating one single field.

More generally, anybody that subscribes to the formencode mailing list sees first-hand just how confusing a lot of people find formencode. It is a very powerful library, but very tricky to get right.

Here’s my question — does validate really need to use formencode? Is there some better, simpler solution? I’ve read about how django tackles this problem, and their approach does seem simpler, but I can’t say for sure until I really build something with it.

If any readers can show how to make a form.clean method that does the 1-2-3 logic I described above, I’d be really grateful.

Maybe formencode just needs a fat cookbook of solutions.

Some research on generic/EAV tables

Yesterday I confirmed a hunch I’ve had about database schema design. Here’s the background: I’m working on a feature where I track employees and their preferred locations, shifts, and stations.

For example, I’ll track that Alice likes the morning shift at the west-side location, and she likes to work the front register station most of all, but her second choice is the drive-though.

Meanwhile, Bob likes the west-side and north-side locations, is indifferent about the shift, and likes the dishwasher station. Note the one-to-many relationship between Bob and his preferred locations and his lack of shift preferences.

I came up with two ways to make my tables:

FIRST METHOD

create table preferred_location (
employee_id int references employee (id),
location_id int references location (id));

create table preferred_shift (
employee_id int references employee (id),
shift int references shift (id));

create table preferred_station (
employee_id int references employee (id),
station_id int references station (id));

Hopefully, this is obvious. I store that Alice likes the west-side location in the preferred_location table like this:

(Alice's ID, west-side location ID)

Then I store the fact that she likes the morning shift in the preferred shift table like this:

(Alice's ID, morning shift ID)

Every time I want to add some new type of preference, e.g., hats, I need to make a table to hold all the legal hats and then make a table linking employees to their hat preference.

SECOND METHOD

This way keeps all the preferences in a single table.

create table preferences (
employee_id int references employee (id),
preference_type text,
preference_value text));

Here’s how I would store that Bob likes to be a dishwasher:

(Bob's ID, 'station', 'dishwasher')

Here’s what I like about this method two: I don’t need to tweak the database schema whatsoever when I dream up new preferences. In fact, I can let system users create new preference types at run-time, and the system just works. In this scenario, adding each employee’s hat preference does not require updating my schema.

On the downside, I wouldn’t have any FK constraints. Somebody could store a preference like they want to work a nonexistent shift and I wouldn’t know until I get an angry customer calling me. I’d have to do a lot of application-level data validation, which I hate.

Finally, there’s just something about method two that seems … wrong, even though I’ve seen variations of this theme in production environments at previous jobs (cough, ALLCODES, cough, PINDATA, cough).

So, with this dilemma, I wrote a post to the PostgreSQL users mailing list and got a fantastic reply. Here’s some excerpts:

Your “method 2″ is something called an Entity-Attribute-Value table design[1].

That said, by going the EAV/”Method-2” route, you’re gaining flexibility, but at the cost of increased complication, and ultimately repurposing a relational database to do something that isn’t very database-like, that’s really more like a spreadsheet. (So why not just use a spreadsheet?) You have little room for recording additional information, like ordering preferences, or indicating that (say) a station preference depends on a location preference, or that a shift time depends on day of the week, etc — so you’re probably not getting as much flexibility as you think. Sure, you could add an “Extra_Data” column, so you have rows:

Marie-Location-West-1,
Marie-Location-East-2,
Marie-Shift-Evening-Tuesday,
Marie-Station-Register-West,
Marie-Shift-Morning-Sunday,

etc. But you can see the data integrity nightmare already, when you somehow manage to record “Marie-Shift-Register-1”. Not to mention that you’ll have to do format conversions for that “Extra_Data” field, and incorporate logic somewhere else in your program that deciphers whatever’s in the generic data field to come up with ordering preferences for locations, station preferences by shift times, or whatever else you want to store.

[1] http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

At this point, I was pretty sure I would go with method 1, but not absolutely certain. Then I read that linked article, which really just said more of the same.

Then I read this Ask Tom post and that erased the last bit of lingering doubt I had. Method 2 is incompatible with performance. Method 2 turns your database into a glorified flat file. Here’s some of my favorite excerpts from the Ask Tom post:

Frequently I see applications built on a generic data model for “maximum flexibility” or applications built in ways that prohibit performance. Many times – these are one in the same thing! For example, it is well known you can represent any object in a database using just four tables:

Create table objects ( oid int primary key, name varchar2(255) );

Create table attributes
( attrId int primary key, attrName varchar2(255),
datatype varchar2(25) );

Create table object_Attributes
( oid int, attrId int, value varchar2(4000),
primary key(oid,attrId) );

Create table Links ( oid1 int, oid2 int,
primary key (oid1, oid2) );

Looks great, right? I mean, the developers don’t have to create tables anymore, we can add columns at the drop of a hat (just requires an insert into the ATTRIBUTES table). The developers can do whatever they want and the DBA can’t stop them. This is ultimate “flexibility”. I’ve seen people try to build entire systems on this model.

But, how does it perform? Miserably, terribly, horribly. A simple “select first_name, last_name from person” query is transformed into a 3-table join with aggregates and all.

There’s a comment on that story about some java developers that insisted on this approach and then had to redesign the whole thing post-launch. I also like the “chief big table” remark.

Anyhow, it’s nice to know that (this time) my instincts were correct.

The high-level view is worthwhile

About two weeks ago, I wrote a provisional patent application.

According to our attorney, a patent application needs to describe the product with enough detail so that anyone skilled in the trade would be able to follow the instructions and build the product.

So I didn’t include a diagram of my database. I figure that somebody skilled in the trade would understand how to design an adequate database schema when I say stuff like “since the system supports sending the same information to many people, where some people receive SMS message, and others get a voice call using text-to-speech, I keep the message data separate from the details about who receives that message and by what means.”

Likewise, when I say the system can forward an incoming SMS message from an employee to a supervisor’s email address, I don’t go through the details of how I parse the binary crapola from the SMS and then construct an email message. Somebody skilled in the trade knows how to do that already, or they know how to learn how to do it.

And I didn’t talk about which web application framework I used or how exactly I deploy my code on the production server. I figure that a skilled programmer can figure out those details without my help. Furthermore, there’s probably a lot of different possible solutions.

So I left a lot of detail out. I focused on how the system responds to certain events and what problems the system was designed to solve.

Even after taking all those shortcuts, I ended up with nearly 10 pages of text and it took nearly an entire week to crank out. After I finished, I learned a few things:

  1. My code follows patterns. I build one feature, then I build another feature using the same style. By the time I’m writing the third feature, again using the same style, similar code now lives in three places.
  2. The application lacks symmetry. There are obvious examples, like where you can download data into a spreadsheet from one screen but not from another. In less-obvious cases, two different methods might solve the same problem, but one method uses a superior technique.

The second point is the opposite of the first point. Instead of solving different problems in similar ways, I’m solving similar problems in different ways.

I need more abstraction. I need to write code that is easier to reuse, and then actually reuse it rather than dashing off some ad-hoc fix.

All of this happened because of my design strategy. I talk to a pool of customers and then I make a list of their problems. Then I sort those problems by (A * B) / C, where:

A = size of the market with this problem
B = how severe it is (i.e., how much would they pay for the solution)
C = how difficult is it to solve.

I pick enough of the best candidates to fill up about 30 days of work. Then I build, test, and deploy, and the cycle starts over again.

So far, I’m happy with working this way, but there’s clearly a downside to only focusing on defining and building the next feature. A colleague calls it “not seeing the forest for the trees.” I think that’s about right. I don’t think of my work as a gestalt as much as a bag of magic tricks. Anyhow, for a few brief moments, after taking a week off from writing code to see write that patent application, I saw the forest. It’s a nice view.

A worse blogging system

I’ve been daydreaming about this for a while. I took some time to write out my thoughts. They’re still half-baked.

Blogs and RSS feeds are pretty good. I don’t have to manually go to sites. My reader polls the sites I subscribe to and it pulls the feeds. But the situation could be a lot better.

Problems with blogging from the reader’s POV

Feed readers don’t work all that well offline. Sure, maybe the RSS feed itself is downloaded, but images won’t likely be pulled down.

Also, polling is kind of goofy. It would be nicer to use some kind of pub-sub framework where I get notified.

RSS feeds usually only store recent stories.

Very often I find a great blog that has dozens of stories. I would love to be able to download the entire blog for offline viewing.

What about Google Gears?

Yeah, what about it? I know of one single blog that actually uses it in this context. I would like to think there is a solution to this problem that doesn’t require building C++ extensions to the browser.

Problems from the writer’s POV

This section is based on my experiences with WordPress and Blogger. Obviously, publishing content on a remote site requires an internet connection to that remote site, but there is no real reason that I should need an internet connection to preview the rendering of my content.

Also, there’s no obvious way I can integrate my source control tools with my blog engine.

Several times I start an article on my laptop, upload it as a draft to my server, then work on it on my server, then lose my internet connection, and go back to an out-of-date draft on my laptop to continue work.

I can write an article much more quickly using simplified markup and I can be pretty certain that it will render into valid HTML. There are a few plugins for WordPress that support writing with markdown, but they require using the wordpress text editor. Sure, I could copy and paste from my real editor, but that’s less than ideal.

The idea

Take these ingredients:

  • Any decentralized source control system.
  • Any simplified markup language, like reStructuredText, markdown, or textile
  • Any tool to make pretty html out of that markup language.

And optionally:

  • A new tool to build lots of index files and RSS feeds.
  • A new tool to notify interested parties that something new is ready, by email, jabber, pingback, etc.

Here’s a simple example:

  1. I write a text file using reStructuredText.
  2. I use a local git repo to track revisions.
  3. I use a local tool to render my text file into HTML and make sure I’m happy with the look. Git is set to ignore these HTML files.
  4. When I’m done, I use git to push my work to a remote repository on a box with a webserver.
  5. That repository has some code that fires when ever it receives a new push:
    • It runs the exact same HTML rendering programs I used locally.
    • It builds a new RSS feed.
    • It rebuilds any internal indexes, tables of contents, whatever are appropriate.
    • It interacts with whatever pub-sub crap is useful so other people learn about the new content.

On the remote git repository, all the rendered HTML, RSS, etc would be available for cloning and the webserver supports people reading my blog the old-fashioned way.

WordPress has other features like being able to navigate through archives, or select stories by tags, or send updates to twitter, etc. I think all of these could be solved somehow during the publishing phase.

For example, navigation through archives doesn’t really require any scripting. I just need to generate indexes for every date range.

Tag-based navigation also doesn’t really require running:

SELECT POSTS.*
FROM POSTS, POST_TAGS, TAGS
WHERE POST.ID = POST_TAGS.POST_ID
AND POST_TAGS.TAG_ID = TAG.ID
AND TAGS.NAME = 'some inoffensive tag name';

It would be sufficient to just regenerate indexes for every tag after each post during the publishing phase.

What about comments?

WordPress allows visitors to post comments on a blog, and it does a pretty good job filtering out spammers with the Akismet plugin. I see two solutions; one is straightforward and mediocre and one is preposterous.

The straightforward solution is to use a service like disqus to track comments on an external server.

The rendered HTML pages would include a blob of javascript. That javascript makes a request to pull all the comments for this URL to the site, and then it appends the text to the DOM. Of course, people that download the material for offline viewing won’t see the comments when they don’t have an internet connection.

Sure, it would be possible to regularly scrape the comments out of the remote server and rebuild all the files available for offline viewing, but that only solves the reading part.

Copyright issues with comments

Imagine I write a blog post with a mediocre code sample inside, and you think of a better way to write the same code.

You start writing a comment on my site (or on my Disqus section, it doesn’t matter) and you’re about to submit, when you see a little line that says all comments become my copyright, and you know you want to use this code in some GPL project.

Maybe you don’t see any lines at all that explain who owns blog comments, so then you’re uncertain about what applies.

Anyhow, there’s a deadweight loss here. You have something to say that would help me out, but you won’t say it. If I knew what you were going to say, I’d make a special exception just for this one comment.

By the way, If you want me to change my license so I don’t own the comments, then I’m faced with a bad situation where somebody can post a comment, and then demand later that I take it down. This is a serious problem for “real” sites. Look at the terms of service on reddit. It insists on a perpetual non-exclusive right to any content posted there.

The ridiculous solution

Just like it will be possible to clone my blog text, commenters should have their own repository where I can clone their comments.

So, when Lindsey comments on my (Matt’s) site, she really writes a post on her own site, and then sends my site a message that says:

Hi Matt,

I read your blog post [1] and I wrote a comment here on my site [2].

You can show my comment on to your site as long as you agree with my comment license [3].

[1] http://matt.example.com/why-rinsing-is-as-good-as-washing

[2] http://lindsey.example.com/soap-is-not-optional

[3] http://lindsey.example.com/comment-license

Lindsey

This message could be an email, an HTTP post, whatever. I could manually process this message, or I could set up some handler that figures out what to do based on some rules ahead of time.

So, we’ve changed the flow of comments from lots of people pushing text to me to a system where they just send me notifications and if I want to pull them, then I can.

This system allows more offline work to be done. Lindsey can clone my site and read it. Then she can write a comment. The next time she has an internet connection, she publishes her comment to her site, which triggers the message to be sent to my site.

Conversation hubs

So, pretend that I don’t show Lindsey’s comment on my site because I think her point makes me look stupid. Now how do third-parties get to see her remarks?

Well, this is a solution that is better than the status quo. Imagine that when Lindsey sent me a message about her comment, she also sent a similar message to another server called a conversation hub.

She tells that hub that her post http://lindsey.example.com/soap-is-not-optional is a response to my post http://matt.example.com/why-rinsing-is-as-good-as-washing.

When somebody clones a feed from my site, they can also check a few of these conversation hubs and optionally clone any posts that have indicated they are relevant to that post.

We’d need better tools to assemble a conversation thread from all the different pieces. But that’s not really that hard.

What about spamming the conversation hub?

A spammer could just send messages to the conversation hubs linking their posts to everything out there.

Well, the conversation hubs could insist on real authentication, and then allow feedback from people. Also, people that check for comments at a hub can request to only see comments that have received aggregate positive feedback.

What about Adsense?

Well, if I switch to this approach, and people start downloading my text files to read offline, they ain’t gonna see my adsense ads, and I’ll be deprived of my $15/year revenue.

But for people that actually make real money off adsense, the question is valid. Remember that we’re talking about helping people read your site offline. Those people that are mostly offline aren’t seeing the site now anyway.

The online visitors can still see them though. Also, people that view the HTML files after cloning my publish node may still see them if they have a working internet connection and they allow the embedded javascript to run.

Sure, there’s a risk that some online viewers will switch to the offline-views and then turn off javascript or their internet connection so that they can’t see the ads.

Publishers would need to weigh this risk. Maybe the solution could be to sell offline copies at a price equal to the expected lost revenue from the switchers.

What about SEO?

It’s a non-issue. The HTML is available online just like it always was.