Best Practices For Programming, Part 1
permalink categories: programming originally posted: 2024-09-18 15:23:00
Over my decades of programming I've invented a couple of best practices recommendations. I'll describe two of them in this post.
I haven't seen either of these independently invented by anybody, much less discovering that they were long-standing wisdom. I've given lightning talks on the first one at Python conventions once or twice; the best feedback I got was from Raymond Hettinger, who if I recall correctly said "I think you're wrong but I don't have any counter-examples." He's never followed up and provided counter-examples—which is because there aren't any.
Eschew The Extraneous Else
This best practice is applicable to any structured programming language with rudimentary flow control structures—in other words, any language used for new projects in the last forty years. I'll show it to you using C syntax and Python syntax but it's applicable in any modern language.
Consider the following pseudo-code:
| 
/* in C */
if (something) {
    do_something();
    return value;
} else {
    do_something_else();
}
 | 
# in Python
if something:
    do_something()
    return value
else:
    do_something_else()
 | 
We have an if statement, with a then clause and an else clause. The then clause ends with an unconditional return statement.
I apply the following rule:
If you have an if-statement with a then clause, and the then clause ends with an unconditional return statement, you don't need an else. If you have one, remove it, and transform the code in the else clause into code simply following the if statement. The code will always be improved.Here's our pseudo-code with this transformation applied:
| 
/* in C */
if (something) {
    do_something();
    return value;
}
do_something_else();
 | 
# in Python
if something:
    do_something()
    return value
do_something_else()
 | 
I claim this is an improvement—and in fact it's always an improvement. It helps the return statement stand out a little more, and makes it clear to the user that the "else" code is the default behavior rather than exception-handling behavior.
Also: it's better simply because it's simpler. An else statement is unnecessary here, and using it doesn't enhance readability. There's no reason to have it. And why have unnecessary stuff in your programs?
There's a corollary to this transformation: if you have an if statement with an unconditional return in its else clause, and the then clause does not end with an unconditional return statment, you should negate the conditional for the if statement, swap the then and else clauses, and apply this transformation.
Here's an example. First, the before:
| 
/* in C */
if (something) {
    do_something();
} else {
    do_something_else();
    return value;
}
 | 
# in Python
if something:
    do_something()
else:
    do_something_else()
    return value
 | 
Now we apply the transformation. Here's the after:
| 
/* in C */
if (not something) {
    do_something_else();
    return value;
}
do_something();
 | 
# in Python
if not something:
    do_something_else()
    return value
do_something()
 | 
Again this is clearly better. I think having an unconditional return in the else clause but not in the then clause is misleading and hard to follow. if statements handle exceptional behavior; if you return from the function if the booelan conditional has one value but not the other, that return statement should be in the then clause and you should definitely apply this transformation.
What if both the then and else clauses end with unconditional return statements? I still perform this transformation, but I might also swap the then and else clauses (and negate the expression). Generally I want the "exception to the rule" code inside the if statement, and the "general case" code outside. If neither is clearly the "exception to the rule"—if both clauses are equally likely—I'll make the then clause be the shorter of the two.
Note that in any modern language, all the before and after examples in the above will compile into the same runtime code anyway. Improving this code is an easy optimization, and modern compilers and interpreters all support that and more. So these transformations won't have any runtime effect—they won't make your code any faster or slower. They're just for readability.
Realtime Input Validation For Python Generator Functions
This best-practice recommendation is more involved, and it only applies to Python.
Consider the following pseudo-code:
    def fn(a, b, c):
        if not a:
            raise ValueError('a must be true')
        if not isinstance(b, str):
            raise ValueError('b must be a str')
        x = a * b
        for value in c:
            yield (x, value)
The first thing we can say about this function
is that it's a generator.  Any function
containing yield or yield from
is a generator.  When you call fn, the
value returned to you will be an iterator, which
you can use anywhere you could use any other iterator.
Generator functions behave in some surprising ways. One wrinkle many folks aren't aware of: when you execute a generator, the code in the function isn't run yet. If I call
    i = fn(x, y, z)
Python creates the iterator, and keeps references
to the arguments, but does not call fn yet.
fn won't be called until the first time you
iterate over that iterator, via either a for
statement or by calling next on the iterator.
Now consider: the first paragraph of code in fn is data validation. We check that the values for a and b are valid before we proceed with processing. But again—this code isn't run until you iterate. When you execute
    i = fn(x, y, z)
fn hasn't been run, which means the inputs
haven't been validated yet.
This can be a problem when you create iterators for later use. Let's say you call fn fifty times, and stick all fifty resulting iterators into a list. Then you go to sleep. You wake up two weeks later because now it's time to process those fifty iterators. You start iterating over them, and, oops! The seventeenth iterator blows up because its inputs were wrong. You get a stack trace, but the stack trace shows you where you iterated over the iterator, not where the inputs were passed in. This makes debugging a royal pain. Who called fn with those bad inputs? It can be hard to match up the creation call site with the bad iterator.
I propose you make the following transformation to your code. First, I'll show you the transformed version of the above code sample, then I'll walk you through the changes step-by-step.
The transformed code:
    def fn(x, c):
        for value in c:
            yield (x, value)
    _fn = fn
    def fn(a, b, c):
        if not a:
            raise ValueError('a must be true')
        if not isinstance(b, str):
            raise ValueError('b must be a str')
        x = a * b
        return _fn(x, c)
This code behaves nearly identically to the original.
When you call fn you get back an iterator.
But we've transformed the code a little; now there
are two functions with the same name, though only the
top one is a generator. There's also an alias in the
middle.
This version is way better because the data validation is now run immediately when you call fn. If you execute:
    i = fn(x, y, z)
If one of the inputs is bad, you'll get a traceback
immediately, at the time the iterator is
constructed!  (Instead of later, at the time
of first iteration.)
This makes debugging so much easier!
Here's the list of steps involved, along with explanations for why you should do it specifically this way.
- 
Ensure that your function has two distinct phases of
operation: first a "data validation" block, followed at
some point by a "code executing yield" block.
The "data validation" block shouldn't contain any yield or yield from statements (or expressions). The "code executing yield" block shouldn't perform any data validation. 
- 
Create a new function above your generator
with the same name.
It should have the same name so that, when users see it in a traceback, they recognize it by name. And it has to be above the original entry point because the second instance will, by design, overwrite the first instance in the relevant symbol table. I'll refer to these two functions as the upper and lower functions respectively. 
- 
Add a statement between the two functions, creating
an alias for the upper function.
The alias can be named anything you want, but it should start with an underscore (_). Symbols starting with underscores are a convention in Python suggesting they're implementation details and shouldn't be interacted with directly. (Python also has some support for name mangling identifiers in classes, but that's only applied to identifiers with two or more leading underscores.) I recommend using the original function name prepended with a single underscore as the name for the alias, as I've done above. 
- 
Move the "code executing yield" code from the
lower function to the upper function.
The "data validation" code must stay in the lower function.
This changes the two functions; now the upper function is a generator, and the lower function is not a generator. 
- 
Add a return statement to the end of the lower
function, calling the upper function using the alias
and returning the result.
Again, you have to use the alias here because the definition of the lower function overwrote the symbol table for the upper function. As needed, define parameters to the upper function, and pass in matching values as arguments in the call to the alias. The signature of the upper function is a private detail that won't affect the user; it doesn't need to match the original function, and you can modify it at any time as your need require. If it's easy, go ahead and make the signature of the upper function match the signature of the lower function; this will make debugging even easier on the user. But this isn't necessary, only a nice-to-have, and if it's inconvenient don't worry about it. 
- This approach also means you can use fn as a data validator. Simply call it and pass in the values you want validated. It'll check them over, and if they're okay it'll return an iterator—which you can just throw away! 
- What about code that's neither "data validation" nor "code executing yield"? You may have code between the two blocks transforming the inputs and preparing data you'll use during iteration, but without actually calling yield yet. Should that go in the upper or lower function? - I'd say there's no hard-and-fast rule here. Code that might throw an exception should probably go in the lower function, so you can early-detect the problem to make debugging easier. (Though if it can throw an exception... that sounds like data validation code to me.) If the code isn't likely to throw an exception, it might make sense to move it to the upper function, as that makes it "lazily evaluated". And if you use your function as a data validator as I suggested above, you don't care about these transformations; delaying them until the upper function is run means "you don't pay for what you don't use", which is always a good policy. 
- If your function calls other generator functions, make sure you call those other generator functions from the lower function. When those other generator functions use this transformation to perform their data validation early, all data validation will be performed early! If you wait to call those other generator functions until the upper function, they won't get to perform their data validation until then, and we're back to hard-to-debug data validation problems. 
- Since the transformed version of the generator function behaves essentially identically to the original, you can safely perform this transformation on existing code. If the inputs are valid, the code behaves identically, and no user call sites will need to change. The behavior only changes when the inputs are wrong and you need to throw an exception; obviously, the exception will be thrown a lot earlier. It's up to you, but I think your users may thank you for making this change! 

 RSS
 RSS