Git bundle converts your whole repository into a single file kind of like webpack

Posted on December 11, 2018 by matt

WHAT

Pretend you just spent a few minutes, hours, or days trying something out, and now you want to get the project off your janky laptop’s hard drive that you just know is gonna die soon.

You’ve been tracking work with git locally because it is trivial to set up:

$ cd myproject $ vi README # pretend this is your brilliant code. $ git init $ git add * $ git commit -a -m "Let's get this party started"

Now you want a single file that has your whole project and all the commits you’ve made.

You can use git bundle for this! Here is how:

$ cd myproject $ git bundle create myproject.bundle --all $ scp myproject.bundle example.com:/tmp/

If it helps, you can think of git bundle as kind of like tar or zip or even webpack. Those are all things that convert a big tree of stuff and spit out a single doodad.

HOW

Here is how to make a single file with everything from all branches:

$ git bundle create myproject.bundle --all

Or you can make a single file (a bundle) that has only the master branch:

$ git bundle create myproject.bundle master

Or make one just with whatever branch you’re working in now:

$ git bundle create myproject.HEAD.bundle HEAD

Now move the bundle to a remote box via scp or rsync or whatever other method you want.

You might ask why you would use rsync or scp, because they both copy a file over a secure tunnel. The only advantage of rsync is that it checks if the file needs to be copied again:

$ rsync -e ssh --verbose myproject.bundle example.com:/tmp/myproject.bundle


    sent 2,793 bytes  received 35 bytes  377.07 bytes/sec

    total size is 2,705  speedup is 0.96
    $ rsync -e ssh --verbose myproject.bundle example.com:/tmp/myproject.bundle

sent 100 bytes received 59 bytes 16.74 bytes/sec total size is 2,705 speedup is 17.01

See how the second time I ran rsync, it only sent 100 bytes? That’s because it tested if the version of myproject.bundle on example.com was out of sync with the one here. That can really, really help when you’re on a slow connection or working with big files.

Here is how to make a new repo based on that bundle:

$ ssh example.com $ git clone -b master /tmp/myproject.bundle myproject2 $ cd myproject2

Pretty fresh, right?

Also, the list-heads command is pretty useful for spying on what is inside a bundle file:

$ git bundle create myproject.all-branches.bundle --all $ git bundle list-heads myproject.all-branches.bundle 5702b7e5d8dd16839850e3fbad44ee69a9411586 refs/heads/master 82a0cd0d59b4929df8ff439cede8a33bbf850cfe refs/heads/more-docs 5702b7e5d8dd16839850e3fbad44ee69a9411586 HEAD


    $ git bundle create myproject.master.bundle master

    $ git bundle list-heads myproject.master.bundle

    5702b7e5d8dd16839850e3fbad44ee69a9411586 refs/heads/master

$ git bundle create myproject.HEAD.bundle HEAD $ git bundle list-heads myproject.HEAD.bundle 5702b7e5d8dd16839850e3fbad44ee69a9411586 HEAD

Unless you use --all, you won’t get all your branches in your bundle! Sometimes, that’s exactly what you want. But for rookies, usually, you’re just trying to ship everything.

WHY

First of all, you can’t beat how easy it is to make a bundle and ship it:

$ git bundle create myproject.bundle --all $ scp myproject.bundle example.com:/tmp/

Second, sure, usually, I would make a new repository on some hosted service like github or bitbucket or gitlab. And I might also make a private repository on a box I rent from Linode (that link has my referral code) or AWS EC2 or Digital Ocean.

But maybe I’m in a coffeeshop with slow wifi, and my friend is sitting right next to me, and I want to share the code with him or her, and it seems crazy for us both to communicate by sending packets around the world.

Also, Using git bundle vs pushing to a remote repository ain’t an either-or thing!

There is nothing wrong with setting up a few cron jobs to run git bundle to create some bundle files and shove them to AWS S3 or dropbox or wherever, even though you’re still paying that exorbitant github bill.

Grab all the rows when parameter is null, or just grab the rows that match

Posted on April 22, 2018 by matt

If I pass in NULL for xyz, I want to get all the rows.

If I set xyz to a value, I only want the rows in the table where the xyz column matches the value I pass in.

And I don’t want to do build up a string in my app code.

Here is how:

select * from my_table where case when %(xyz)s is null then true when %(xyz)s = xyz then true else false end ;

And here’s a slight tweak if you want to pass in an array of allowed values:

select * from my_table where case when %(xyz)s is null then true when xyz = any(%(xyz)s) then true else false end ;

I hope this helps! If you know a better way, let me know!

Meaningful primary keys beat sequences and UUIDs

Posted on March 18, 2018 by matt

Summary

Instead of serial primary keys or UUIDs for primary keys, when possible, you should make a PK that somehow describes the underlying data. This is not as hard as you might think!

In detail

Everything you read about databases will say that every table really ought to have a primary key, in other words, some value that is unique in every row. The clumsiest, crudest, simplest way to do that is to use something like a sequence (or a row number) for that primary key.

Here’s a table where I track schedules at golf courses:
create table loops -- a "loop" is what people call a game of golf ( loop_number serial primary key, club_short_name text not null references clubs (short_name), tee_time date not null default today(), golfer integer not null references golfers (golfer_id), number_holes integer not null check (number_holes in (9, 18)), unique (club_name, golfer, tee_time, number_holes) -- prevent duplicates! );

Every time somebody inserts a row in that loops table, the loop_number column will automatically get a unique number. Now if we’re making some kind of web app, it is easy to make a link to a particular game (aka loop) by just making a URL with the loop_number in there, like this:
https://example.com/loop?loop_number=99
or this:
https://example.com/loop/99

Depending on how fancy you like your URLs.

This style was super-popular in the Ruby on Rails heyday.

Then maybe a few years after that, when distributed databases started catching on, it wasn’t as easy to get the next sequential value, because you couldn’t check all the nodes quickly. So people started using stuff like UUIDs like this:
row_id uuid not null default uuid_generate_v4() primary key
instead of
row_number serial primary key

And then URLs started looking more like this:
https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

Sidenote

I’ve heard a lot of people argue that replacing sequential primary keys with UUIDs is somehow more secure because it is very easy for some malicious person to change
https://example.com/loop/99
to something like
https://example.com/loop/100
and possibly spy on information not meant for them. Doing the same trick with UUIDs is not so easy; in other words, if you hand out a URL like

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

to some customer, they probably won’t find any valid rows by just slightly incrementing that UUID (and that’s if they can figure out HOW to increment it).

I personally don’t think that switching from serial PK’s to UUID’s is always enough to block this attack, but it is absolutely a great first step! In practice, it seems pretty hard to guess another valid UUID, but it certainly is not impossible, and the RFC for UUIDs has this little nugget of advice:

Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will
exacerbate the situation.

Incidentally, this URL tweaking is a big source of data breaches! I’ve seen it in the wild numerous times. I found this post that describes this kind of attack in more detail.

Unfortunately, the web popular frameworks don’t offer much help for this issue. They all make it easy to check that a user is authenticated (they are who they say they are), and maybe they offer some kind of role-based permissions, but you’re pretty much on your own when building a multi-tenant system with data privacy.

In other words, if you want to block user X from spying on data that should only be seen by user Y, you need to check that yourself, in all the different places in your code where you pull back data.

All that said, my favorite reason to go with UUIDs is that you don’t reveal that you only have like three clients on your system when you’re out doing demos, and that’s pretty dang important when you’re fundraising!

Back to the main point

Based on the table above, any loop is a unique combination of a club, a golfer, a tee time, and a number of holes of golf. For example, one row might track that Thurston Howell III is playing 18 holes at snooty-snooterton country club on April 1st, 2018.

It would be great if we could have a primary key like this:
snt-2018-04-01-th3-18

The snt part identifies the club, the 2018-04-01 part identifies the tee time, the th3 identifies the golfer, and the 18 part identifies the number of holes.

This makes for vastly easier to understand data! And it isn’t hard to do this. Just add a trigger on your table that fires before insert and update that sets your primary key column:
create or replace function set_loop_pk () returns trigger as $$ begin


    NEW.loop_pk = NEW.club_short_name

        || '-'

        || to_char(NEW.tee_time, 'yyyy-mm-dd')

        || '-'

        || NEW.golfer_initials

        || '-'

        || number_holes,

    ;
    return NEW;
end;

$$ language plpgsql;

create trigger set_loop_pk before insert or update on loops for each row execute procedure set_loop_pk();

Of course I replaced columns like golfer with golfer_initials, but hopefully that didn’t trip you up.

Also, the code above assumes you’re using the postgresql database, but you can translate it into whatever other database environment you want.

If you’re some kind of crazy person, I suppose you could even build that PK in your ORM layer.

Why is this better? This is better because anyone that sees a meaningful PK can infer a lot about the inner data. This is one of those things that will save you tons of time in a crisis because you don’t need to write lots of joins to understand who the heck is user ‘ac29a573-35f2-4200-b10a-384999426ee6’ or which club has club_id 876.

Your end-users will be more confident in the system as well. Labels printed with a meaningful PK are self-evident. URLs hint about the contents.

Sure, there are times when you need obfuscation, but it easy to have a meaningful PK and then scramble it somehow, with a real cryptographic solution, rather than leaning on a UUID.

Or really, the best approach in my opinion is to keep your meaningful PK, but also tag on another parameter that combines that meaningful PK with a secret value and then hashes it. People call this approach an HMAC.

Last point: if your data is only unique because you’re using a sequence or because you’re using a UUID, well, you’re not really “doing databases right”. Using a meaningful PK means you’ve figured out enough about what you’re storing to know what makes a row unique.

Product review: todoist

Posted on May 1, 2017 by matt

I’ve been using Todoist for a few months. It’s not bad!

What I like

Since there’s a mobile app and a web interface, it is really likely I’ll get stuff stored in there.
Nearly no extra data is required to store a task. I just can just put “bananas” in a new task’s title and hit save. Again, this increases the likelihood I’m actually going to use the product. If a bunch of fields were required and didn’t have defaults, I might put off using the app.
The alexa interface is fun. I can say “alexa add to my todo list …” and then add a line. I can also have alexa read back my to-do list.
I like how projects and tasks can both be nested. I like how a task only belongs to one project, but can have many labels on it. And I like the priority feature.

What isn’t perfect

There’s no obvious way to track the estimated size / difficulty / required work for a task. In other words, I can’t mark a task as “easy” or “really tricky” and then rank by that.
Linking to tasks isn’t fun or easy. Links look like this:
```
https://todoist.com/showTask?id=2179109422
```
I found that link buried behind two mouse clicks. Meanwhile, github issues start at #1 in each project and increment up from there. That is so much nicer! I can easily tell somebody “hey look up task XYZ-432” but I can’t remember ten digits!
A task has a title, but I want another field where I can add more description of the task. For example, some times, I want to add add links to screenshots, or blog posts with discussions, etc.
Tasks need more statuses, like “in progress” and “will not do this”. Right now, as far as I can tell, a task is either not finished or finished or deleted. I need more statuses!
There’s not an easy way to put tasks in order relative to each other. It is possible to set priority levels on tasks, but if three tasks are all at the same priority level, it isn’t easy to put them in a particular order.
This is kind of complex, and expects a lot from a single application, but there a lot of times that I want to store stuff related to a project that aren’t to-do entries. For example, say I have a conversation with a client. We probably talked about a bunch of things:
- near-term to-do items
- stuff that would be nice, but not immediately planned
- background information about the project
The last point doesn’t fit that well into the todoist model!
You can’t (as far as I can tell) upload attachments to tasks. Update! You can, but you have to add them as comments!
There’s a developer API, but not an official CLI program. Instead, there’s a bunch of half-finished CLI programs on github.

The Pareto principle (why some bugs are OK to ignore)

Posted on March 22, 2017 by matt

There’s this thing called the Pareto Principle, which says:

roughly 80% of the effects come from 20% of the causes

You can quibble about the specific number values. Maybe 80 and 20 aren’t exactly right. But as long as you have customers that aren’t perfectly evenly distributed across bugs, you should consider that maybe some of your bugs aren’t worth fixing.

Here’s a contrived example: Imagine you got a product XYZ, and you got 100 users. They’re all mad because of five bugs (bug A through bug E).

80 of your users are mad because of bug A. (80% of 100)
16 other users are mad because of bug B (80% of the remaining 20)
3 users hate bug C (80% of the remaining 4 users)
User #100 filed two bug reports: D and E. He won’t be happy until both are resolved.

If you add up 80 + 16 + 3, you’re at 99 users. In other words, if you fix 3 out of 5 bugs, 99% of your customer base would be satisfied.

However, making that last customer happy is probably not worth it! You can satisfy 99% of your market by doing 60% of the required work.

Stop offering janky fixes

Posted on December 11, 2016 by matt

When doctors show up to work, they take time to wash hands thoroughly even if there are queued-up patients in critical status.

Meanwhile, us programmers deal with production bugs in the most expedient way possible. And usually that involves some janky fix and a comment like this:
# TODO: this won't work forever
and then we’re on to the next crisis.

We have to get better about this. Its fun to play the hero, and say we can fix everything right away, but in the end, we are digging our own graves.

This post is fueled by me cleaning up a mess caused by too many janky fixes all imploding simultaneously.

Last point: don’t blame your bosses and their unreasonable demands. Don’t expect them to understand the PROs and CONs. Simply do not offer any solution that makes the problem worse. We are the experts!

Going back to the doctor example, I’m sure the desperate patient would love to rush the doctor, because sure, 9 out of 10 times, their hands are probably clean enough, and if an infection does start, well, that’s what antibiotics are for.

But part of the reason why doctors are so revered and so well compensated is because they insist on being treated a certain way.

Ask a doctor for a “good enough” solution, or maybe ask how much would it cost if they don’t do it “the absolutely perfect” way, or any of the other lines your middle managers and sales people hit you with when trying whittle down your estimate.

Doctors will just stare at you like you’re an idiot. That’s what we need to start doing.

Postgresql: convert a string to a date or NULL

Posted on May 30, 2016 by matt

We’re working with some user-submitted text that we need to convert into dates. Most of the data looks correct, but some of it looks glitchy:

expiration_date 8/16 5/17 damaged 6/16

See that line with “damaged” in there? That will cause to_date to throw an error:
select to_date('damaged', 'YY/MM'); ERROR: invalid value "da" for "YY" DETAIL: Value must be an integer.

So I wrote this function:
create or replace function dt_or_null (s text, fmt text)


returns date

as

$$

begin
    return to_date(s, fmt);
exception

    when others then return null;
end;

$$ language plpgsql;

My advice to new programmers looking to start their career

Posted on May 28, 2016 by matt

Your resume is probably pretty good, but you need to show you can build stuff beyond school assignments. You don’t need a job to do that though! Here’s my advice:

Prove that you can build and maintain something without being supervised. Build some kind of web project in your free time and host it online on AWS or rackspace or my favorite, Linode. That link has my referral code in it, by the way 🙂

Start with something as easy as possible. Don’t worry though — you will discover a ton of difficulties as you work through it. Your project can be anything:
- a really simple recipe database
- the most popular mens socks on Amazon
- weather forecast for nearby cities
At the bottom of every screen in that project, add a link to your github profile and your linkedin page, and put your email in there and say something like “I’m looking for work!”

Once you’re done, pick a new project. Maybe rewrite the same thing in a different language. The point here is to make real things that regular people can interact with.

Silly projects are likely to get more attention. For example, the KJV Programming tumblr site is hugely popular and doesn’t really do anything useful for anyone.
Get involved with some volunteer programming work. In Cleveland, there are several groups of programmers that volunteer their time. Look at Cleveland Givecamp, for example, or Open Cleveland.

Where ever you are, I bet there’s a group like this already. If not, start one!

Or, just find an organization like a church or a club or a business that you like and offer to work with them to do something like set up a better website, automate some financial reports, or even just help them manage their facebook / instagram / twitter accounts.

You will learn how to work with non-technical people this way. That is an important skill!
Start a blog.

Write tutorials for little things you figure out while building your projects. Write tutorials for stuff that you are learning in school, like recursion, or operator overloading in C++, or why you hate or love one language vs another.

Write about the nonprofits or clubs or small businesses you’re working with.

Practice writing clearly and succinctly.

Read William Strunk’s The Elements of Style at least three times. It’s nearly a hundred years old and still the best writing guide out there.

Publish what you do on twitter and reddit and hacker news and other places so you get more attention. Don’t waste a minute arguing with the haters though. Nobody cares about them.

Add google analytics to your blog and study what posts attract the most attention.
Go to as many technical meetups as you can and introduce yourself to people and tell them you are looking for work. Talk about what you are working on. Ask them where they work and if they like it and if they know of openings.

If you’re anywhere near Columbus, Ohio, show up at PyOhio on July 30th and 31st and introduce yourself to as many people as you can. Maybe even do a 5-minute lightning talk on one of your projects — the sillier the project is, the better.
Cold-call recruiters at companies like Robert Half, Oxford, Randstad, etc and tell them you’re looking for work. Ask them what skills are the most sought after.

Learn those skills, and build projects with them, and then write out about it.

The point with all this stuff is to make yourself a programming celebrity. You don’t want to go looking for jobs — you want jobs to come to you.

Good luck on your quest!

Consider that you are lucky to live at a time where a few of us have vastly more upward economic mobility than ever before. It just takes effort.

Are you an animal or a human?

You should show up at the first Heights Code Hop

Posted on February 1, 2016 by matt

A few of us are organizing a one-day meetup at the Cleveland Heights Library on Saturday, April 30th, 2016.

We’ll talk about open-source web technology.

We need talk submissions and we need people to show up!

The website is here: heights-code-hop.org.

Please help spread the word!

What’s good and bad about github issues

Posted on January 2, 2014 by matt

Ticketing / workflow / bugtracker systems are always nasty. Github’s is pretty good. Maybe the best of what’s out there. But it ain’t perfect.

Here’s what I like:

It’s ready to go immediately once you start your github repo.
You can link a commit to an issue by mentioning the issue number in the commit.
Labels let you store a TON of metadata.

And what I dislike:

No obvious way to tell if somebody is actively working on an issue. More generally, no “status” field exists on an issue.
No obvious way to do a query like “label X or label Y”.
No command-line interface.
Since github doesn’t include a built-in mailing list, github issues often get used for support requests. Then when somebody explains “here’s how to do … “, the issue gets closed, and that helpful expensive-to-write documentation is hidden away. The solution here is for github to host a mailing list for every repository.

t+1

Programming, gardening, economics, life in Cleveland Heights

Category Archives: Programming