maradydd | [geekery, databases] Follow the arrows

This collection of notes is half a reminder to myself of how I undertake a process like this, and half for anyone who's curious about one possible approach for how to grok a very large codebase fairly quickly.

Step 1: Define the problem. In this case, I want to extend the syntax of PostgreSQL to incorporate WHERE clauses of the form

WHERE EXAMPLE KEY foo LIKE (list-of-keys) NOT LIKE (list-of-other-keys)

and ORDER BY clauses of the form

ORDER BY EXAMPLE KEY foo {ASC | DESC} (list-of-keys) (list-of-more-keys) (list-of-still-more-keys) .... (final-list-of-keys)

Such clauses, when executed, will train a binary classifier (for WHERE clauses) or a ranking classifier (for ORDER BY clauses) on those lists of keys, then use that classifier to perform binary filtering or ordered ranking.

I've already written my binary and ranking classifier code. I already know how to alter the grammar to include these types of clauses. For my next trick, I have to figure out how to make that new grammar actually do something.

Step 2: Everything has to start somewhere. In a C program, that's main(). But just reading main() is hard; in Postgres, main.c is 335 lines by itself, most of which are OS- and architecture-specific #defines, and that doesn't tell you very much. However, I already know that I'm going to modify the syntax, and by extension, I also have to modify the semantic actions attached to that syntax -- in other words, I have to modify other parts of the codebase to hook into the classifier code I've written. So this gives me two start points: main() and the grammar, which is used to create a bison parser. All bison parsers create a function called yyparse(), which will get called somewhere else, and that's a new start point.

Thus, my mission now is to fill in the holes between yyparse() and main(). Judicious use of grep -r turns up the following, simplified a bit for easy readability:

yyparse() is called by raw_parser() in src/backend/parser/parser.c.
raw_parser() is called by pg_parse_query() in src/backend/tcop/postgres.c.
pg_parse_query() is called by exec_simple_query() in src/backend/tcop/postgres.c.
exec_simple_query() is called by PostgresMain() in src/backend/tcop/postgres.c.
PostgresMain() is called by main() in src/backend/main/main.c.

It took some trial and error to find this out, of course; in several cases, a function is actually called by several other functions, and though in at least one of these cases there is more than one route to main() (pg_parse_query() is useful in quite a few places, it would seem), this appears to be the shortest one.

Of course, it always helps when functions are named meaningfully. exec_simple_query looks promising, and it appears to be the intermediary step between the query parser and the backend-specific main(). Let's look at that.

(Fair warning: I'm actually skipping a step here, but it's kind of an unfair one to assume for anyone else. When I was working on Dejector, I worked with the bison grammar enough to remember that the semantic action attached to an ORDER BY clause creates a SortBy node (and I wouldn't expect anyone who hadn't spent three weeks buried in a ten-thousand-line grammar to know that at all). A little digging revealed that a SortBy node is just a Node struct with one of its values set to the string "SortBy". Those nodes also turn up in usefully named functions like transformSortClause(), and in fact there's also a struct called SortClause, but grepping around in this section of the code wasn't very informative at first -- probably because I was messing around in the middle of the path that leads from query to instructions, and wasn't sure which way was forward and which way was backward. So we'll see those Nodes and Clauses again, but my notes are going to follow the path that actually worked.)

Step 3: Read and translate. exec_simple_query() is a fairly large function, mostly because of all the memory-management stuff that a large multithreaded application needs to take care of. It's six pages printed out, which is a lot of stuff to keep track of. So I abridged it into a few lines of pseudocode -- not the Python-like pseudocode that I often write when I'm drafting an algorithm, but plain English so that I can remember what I'm doing. It looks like this:

Do some housekeeping.
Start up a new transaction.
Parse the query, drop the results into parsetree_list.
For each Cell in the List: // each ListCell is cast to a Node
1. call CreateCommandTag() // each command will be some type of Statement
2. BeginCommand() // does NOTHING WHATSOEVER!
3. Test to see if we're in an aborted transaction
4. Analyze query
5. Plan query
6. Create a portal to run the query in. (What's a portal?)
7. Create a receiver // destination stream?
8. Run the portal
9. Drop the portal
10. If this is the last Stmt of the query, shut down the transaction.
11. EndCommand()
Make sure transaction is closed.
Do some time calculations and more housekeeping.

And this tells me where I need to go next: steps 4.4, 4.5 and 4.8 tell me what functions are used to analyze, plan and run the query. (The analyzer, the planner and the executor, for those of you who needed a refresher on your database-backend class.) In other words, this is where the SQL itself gets translated into the code which is actually executed to return the results of your query. That is why they call it an SQL interpreter, after all. :)

Step 4: Lather, rinse, repeat. 4.4 is how I translated a call to pg_analyze_and_rewrite(), and 4.5 corresponds to pg_plan_queries(), both of which are also in postgres.c. Figuring out what those do is, once again, a matter of reading through one line at a time and translating, grepping through the codebase as necessary to find out where those functions' sub-functions are defined and what those do. (At the moment I'm poking around in analyze.c, learning exactly how a Parse is turned into a Query.) So far I've discovered, from the grammar and the CreateCommandTag() function, that there are several dozen types of Statements. Since WHERE and ORDER BY only apply to SELECT statements, that's all I care about, which helps me narrow down my hunt even more.

Every once in a while I also find myself grepping for some typename that shows up unannounced. This is C, so typically something like Query or List or NodeTag is a struct (or, less often, an enum; in one case, a union). If it were C++, my search would be a lot more fragmented, since structs and classes can have methods attached, but in this case I'm usually ending up in some header file, finding out what kinds of data and pointers live in a particular struct. 'grep -r "} name" *' has been especially useful in these cases. Yay typedefs.

parse_analyze() calls do_parse_analyze(), which calls transformStmt() -- a function which takes a parse tree and returns a Query. It's one giant switch(), and all we care about is the T_SelectStmt case. This case calls another helper function, transformSelectStmt(), which creates, populates and returns that Query, using a bunch of other helper functions. One of those is transformWhereClause(), and one of them is transformSortClause(). Better and better!

Let's look at WHERE clauses first. transformWhereClause() calls two functions, transformExpr() and coerce_to_boolean() (that latter makes sense, since WHERE clauses are for boolean filtering). transformExpr() lives in parse_expr.c and coerce_to_boolean() in parse_coerce.c, at which point I fall to my knees and sing the praises of the kindly gods of refactoring. Here's where the magic really starts! Type checking, typecasting of elements of the query string into the actual data types the optimizer and the executor will perform computations on, and most helpful for me, a switch() on the NodeTag of the WhereClause which points me to exactly which code block I care about. But wait: what's the NodeTag of the WhereClause? Checking back with the grammar, I see the rule where_clause: WHERE a_expr { $$ = $2; }. Oh: its tag is a_expr, and indeed there is a T_A_Expr case. Which is the complicated one: a big switch() on the kind data member of the A_Expr node that our generic Node gets cast to. Fortunately, I have never been smart enough to let complicated things stop me.

We've got several cases here: AEXPR_OP, AEXPR_AND, AEXPR_OR, and a few others. They mostly seem to have one thing in common, which is also borne out by the grammar; it's clearest with AEXPR_OP. Those rules tend to look like aexpr OPERATOR aexpr, and the function creates child nodes called lexpr and rexpr -- "left expression" and "right expression", anyone? But I don't have two expressions -- I have three: the key field, the list of things it's like, and the list of things it's not like. So it looks like I get to add a new category of A_Expr. Perhaps I'll call it AEXPR_TOP, "A_Expr with a ternary operator". Furthermore, I'll have to create a new case in this function which will do some checks on the key column name and the lists -- I wonder if this is where I should check to make sure that the types of the items in the list match up with the type of the key column. Probably. But I'll figure that out when I drill down low enough to actually write this code; we're still about five thousand feet above the ground, right now.

So, moving on to coerce_to_boolean(). Following the arrows again, that lands us in coerce_type(), which makes its decisions based on the inputTypeId and the targetTypeId of the coercion. Where are those coming from? Okay, back to coerce_to_boolean: it says inputTypeId = exprType(node). Okay, back to parse_expr.c to check out exprType(). There is no exprType for T_A_Expr. Lovely. That must mean that somewhere in transformExpr(), the new Node that gets returned gets some other type assigned to it. (To be fair, at least they're creating a new Node to return and not just casting through void pointers the way a lot of C programmers would. Type safety? What's that?) But since I'm planning on creating a new category of A_Exprs anyway, this means I can cheat: a quick glance at the AEXPR_AND case tells me that the returned Node is already a BoolExpr, and I'd be willing to bet that AEXPR_OPs of the = type are also BoolExprs. If I do that, then the call to coerce_to_boolean() will return trivially because it's already boolean.

And that's the end of transformWhereClause(). So now I have some sense of what functions I'll have to write to prepare inputs for the optimizer and the executor; I still don't know what the code I write for the executor to use my classifier code will look like, but I have a road map now.

Okay, so where are we now? We've analyzed the analyzer (*grin*); there's still a long way to go, but I could start laying down lines of code now if I felt like it. I'm not going to, because I want to have some sense of what I'm going to do to the optimizer (the answer at the moment may very well be "nothing", but I need to read through it anyway) and to the executor before I start putting down actual C. But pseudocode is definitely rational right now: translating what I've learned in the paragraphs above into a rough outline of what I'll need to add to those existing routines to make them handle my Evil Plans.

At this point I'm grateful to the PGDG for writing code that's well-commented and not too hard to understand, even if they haven't published an API. More as time goes on; this is the first time I've made sweeping changes to a codebase anywhere near this size, and I'll be keeping notes on how I make the changes now that I've figured out at least part of where they need to be made.

cristalia and

matociquala often remark that every novel is different, and I expect every codebase is different as well -- but if these notes help you, then enjoy. :)

Flat | Top-Level Comments Only

From:

pturing.livejournal.com

so if I understand the 4th and 5th to last paragraphs correctly...

though this:
foo BETWEEN (bar) AND (baz)
appears to have a ternary operator,
it can actually just be translated by the grammar into
foo >= bar AND foo <= baz

in the case of:
EXAMPLE KEY foo LIKE (list-of-keys) NOT LIKE (list-of-other-keys)
you'll actually need to process (list-of-keys) and (list-of-other-keys) at the same place and time; it wouldn't do to translate it into this:
EXAMPLE KEY foo LIKE (list-of-keys) AND EXAMPLE KEY foo NOT LIKE (list-of-other-keys)

thus necessitating an AEXPR_TOP

?

maradydd.livejournal.com

you'll actually need to process (list-of-keys) and (list-of-other-keys) at the same place and time

Got it in one. <machine-learning-shorthand>The elements in (list-of-keys) are classified +1 and those in (list-of-other-keys) -1; the rows corresponding to each of those keys are concatenated into the x-vector of training vectors, the classes become the y-vector, and that's the training input for the machine, which, once trained, will return the appropriate class for novel inputs (and I'll have to coerce that to boolean).<machine-learning-shorthand>

What I said earlier, plus: I think in practice I could expand it into LIKE and NOT LIKE, but the way I'm using these keywords is radically different from how the existing grammar uses them under the hood. So I'd still be writing new functions, just farther down in the source tree.

Maybe at some point I'll also add the one-class problem (so that you could do something like WHERE EXAMPLE KEY foo LIKE (list-of-keys)), but not yet, and it would still hook into totally different code than the existing LIKE code.

grepmaster.livejournal.com

grep is your friend

that's correct. * friended

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Radio Free Meredith

science keeps me warm at night

[geekery, databases] Follow the arrows

(no subject)

(no subject)

(no subject)

(no subject)

Profile

September 2010

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags