This is pgintcl/INTERNALS, notes on internal implementation of pgintcl. Last updated for pgintcl-3.4.0 on 2011-09-19 The project home page is: http://sourceforge.net/projects/pgintcl/ ----------------------------------------------------------------------------- INTERNAL IMPLEMENTATION NOTES: This information is provided for maintenance, test, and debugging. A connection handle is just a Tcl socket channel. The application using pgin.tcl must not read from or write to this channel. Internal procedures, result structures, and other data are stored in a namespace called "pgtcl". The following namespace variables apply to all connections: pgtcl::debug A debug flag, default 0 (no debugging) pgtcl::version pgin.tcl version string pgtcl::rn Result number counter pgtcl::fnoids Function OID cache; see FAST-PATH FUNCTION CALLS pgtcl::errnames Constant array of error message field names The following arrays are indexed by connection handle, and contain data applying only to that connection: pgtcl::notice() Command to execute when receiving a Notice pgtcl::xstate() Transaction state pgtcl::notify() Notifications; see NOTIFICATIONS pgtcl::notifopt() Notification optionss; see NOTIFICATION pgtcl::std_str() For pg_escape_string etc; see ESCAPING pgtcl::bepid() Backend process ID (PID) Additional namespace variables are described in the sections below. Result structure variables are described next. ----------------------------------------------------------------------------- RESULT STRUCTURES: A result structure is implemented as a variable result$N in the pgtcl namespace, where N is an integer. (The value of N is stored in pgtcl::rn and is incremented each time a new result structure is needed.) The result handle is passed back to the caller as $N, just the integer. The result structure is an array which stores all the meta-information about the result as well as the result values. The result structure array indexes in use are: Variables describing the overall result: result(conn) The connection handle (the socket channel) result(nattr) Number of attributes (columns) result(ntuple) Number of tuples (rows) result(status) PostgreSQL status code, e.g. PGRES_TUPLES_OK result(error) Error message if status is PGRES_FATAL_ERROR result(complete) Command completion status, e.g. "SELECT 10" result(error,C) Error message field C if status is PGRES_FATAL_ERROR. C is one of the codes for extended error message fields. Variables describing the attributes (columns) in the result: result(attrs) A list of the name of each attribute result(types) A list of the type OID for each attribute result(sizes) A list of attribute byte lengths or -1 if variable result(modifs) A list of the size modifier for each attributes result(formats) A list of the data format for each attributes result(tbloids) A list of the table OIDs for each attribute Variables describing prepared query parameters in the result: result(nparams) The number of prepared statement parameters result(paramtypes) List of prepared statement parameter type OIDs Variables storing the query result values: result($irow,$icol) Data value for result result(null,$irow,$icol) NULL flag for result The pg_exec and pg_exec_prepared commands create and return a new result structure. The pg_result command retrieves information from the result structure and also frees the result structure with the -clear option. (Other commands, notably pg_select and pg_execute, use pg_exec, so they also make a result structure, but it stays internal to the command and the caller never sees it.) The result structure innards are also directly accessed by some other routines, such as pg_select and pg_execute. Result structure arrays are unset (freed) by pg_result -clear, and any left-over result structures associated with a connection handle are freed when the connection handle is closed by pg_disconnect. The query result values are stored in result($irow,$icol) where $irow is the tuple (row) number, between 0 and $result(ntuples)-1 inclusive, and $icol is the attribute (column) number, between 0 and $result(nattr)-1 inclusive. If the value returned by the database is NULL, then $result($irow,$icol) is set to an empty string, and $result(null,$irow,$icol) is also set to an empty string for this row and column. For non-NULL values, $result(null,$irow,$icol) is not set at all. The "null,*,*" indexes are used only by pg_result -getNull if it is necessary for the application to distinguish NULL from empty string - both of which are stored as empty strings in result($irow,$icol) and return an empty string with any of the pg_result access methods. There is no way to distinguish NULL from empty string with pg_select, pg_execute, or pg_exec_prepared. The entire result of a query is stored before anything else happens (that is, before pg_exec and pg_exec_prepared return, and before pg_execute and pg_select process the first row). This is also true of libpq and pgtcl-ng (in their synchronous mode), but Tcl can be slower. Extended error message fields are new with PostgreSQL-7.4. Individual parts of a received error message are stored in the result array indexed by (error,$c) where $c is the one-letter code used in the protocol. See the pgin.tcl documentation for "pg_result -errorField" for more information. (As of 2.2.0, pg_result -errorField is the same as pg_result -error: both take an optional field name or code argument to return an extended error message field, rather than the full message.) ----------------------------------------------------------------------------- BUFFERING PostgreSQL protocol version 3 (PostgreSQL-7.4) uses a message-based protocol. To read messages from the backend, pgin.tcl implements a per-connection buffer using several Tcl variables in the pgtcl namespace. The name of the connection handle (the socket name) is part of the variable name, represented by $c below. pgtcl::buf_$c The buffer holding a message from the backend. pgtcl::bufi_$c Index of the next byte to be processed from buf_$c pgtcl::bufn_$c Total number of bytes in the buffer buf_$c. For example, if the connection handle is "sock3", the variables are pgtcl::buf_sock3, pgtcl::bufi_sock3, and pgtcl::bufn_sock3. A few tests determined that the fastest way to fetch data from the buffers in Tcl was to use [string index] and [string range], although this might not seem intuitive. ----------------------------------------------------------------------------- PARAMETERS The PostgreSQL backend can notify a front-end client about some parameters, and pgin.tcl stores these in the following variable in the pgtcl namespace: pgtcl::param_$c Array of parameter values, indexed by parameter name where $c is the connection handle (socket name). Access to these parameters is through the pg_parameter_status command, a pgin.tcl extension. ----------------------------------------------------------------------------- PROTOCOL ISSUES This version of pgin.tcl speaks only to a Protocol Version 3 PostgreSQL backend (7.4 or later). There is one concession made to Version 2, and that is reading an error message. If a Version 2 error message is read, pgin.tcl will recognize it and pretend it got a Version 3 message. This is for use during the connection stage, to allow it to fail with a proper message if connecting to a Version 2-only backend. ----------------------------------------------------------------------------- NOTIFICATIONS An array pgtcl::notify keeps track of notifications you want. The array is indexed as pgtcl::notify(connection,name) where connection is the connection handle (socket name) and name is the parameter used in pg_listen. The value of an array element is the command to execute on notification. This can be a procedure name, or a procedure name with leading arguments. It must be a proper Tcl list. Starting with PostgreSQL-9.0.0, a 'payload' string can be provided with the SQL NOTIFY command. Starting with pgin.tcl-3.2.0, this payload (if not empty) will be passed as an additional argument to the command. The command is taken as a list, and the payload is appended as in lappend. The resulting list is the command to execute. If there is no payload, or it is empty, or the server is older than PostgreSQL-9.0.0, no additional argument will be passed to the command. The command should therefore always accept an optional argument. Starting with pgintcl-3.4.0, there is an additional array pgtcl::notifopt() to store options for the notification. This array is indexed the same way as pgtcl::notif(), and holds integer values. The value is 0 if there are no options for this notification. The value is 1 if the notification listener should get the notifying backend process ID as an argument, as indicated by the -pid option to pg_listen. No other options are defined. ----------------------------------------------------------------------------- NOTICES Notice and warning message handling can be customized using the pg_notice_handler command. By default, the notice handler is puts -nonewline stderr and this string will be returned the first time pg_notice_handler is called. A notice handler should be defined as a proc with one or more arguments. Leading arguments are supplied when the handler is set with pg_notice_handler, and the final argument is the notice or warning message. ----------------------------------------------------------------------------- LARGE OBJECTS The large object commands are implemented using the PostgreSQL "fast-path" function call interface (same as libpq). See the next section for more information on fast-path. The pg_lo_creat command takes a mode argument. According to the PostgreSQL libpq documentation, lo_creat should take "INV_READ", "INV_WRITE", or "INV_READ|INV_WRITE". (pgin.tcl accepts "r", "w", and "rw" as equivalent to those respectively, but this is not compatible with pgtcl-ng.) It isn't clear why you would ever create a large object with other than "INV_READ|INV_WRITE". The pg_lo_open command also takes a mode argument. According to the PostgreSQL libpq documentation, lo_open takes the same mode values as lo_creat. But in libpgtcl the pg_lo_open command takes "r", "w", or "rw" for the mode, for some reason. pgin.tcl accepts either form for mode, but to be compatible with libpgtcl you should use "r", "w", or "rw" with pg_lo_open instead of INV_READ, INV_WRITE, or INV_READ|INV_WRITE. ----------------------------------------------------------------------------- FAST-PATH FUNCTION CALLS Access to the PostgreSQL "Fast-path function call" interface is available in pgin.tcl. This was written to implement the large object command, and general use is discouraged. See the libpq documentation for more details on what this interface is and how to use it. It is expected that the Fast-path function call interface in PostgreSQL will be deprecated in favor of using the Extended Protocol to do separate Prepare, Bind, and Execute steps. See PREPARE/BIND/EXECUTE. Internally, backend functions are called by their PostgreSQL OID, but pgin.tcl handles the mapping of function name to OID for you. The fast-path function interface in pgin.tcl uses an array pgtcl::fnoids to cache object IDs of the PostgreSQL functions. One instance of this array is shared among all connections, under the assumption that these OIDs are common to all databases. (It is possible that if you have simultaneous connections to multiple database servers running different versions of PostgreSQL this could break.) The index to pgtcl::fnoids is the name of the function, or the function plus argument type list, as supplied to the pgin.tcl fast-path function call commands. The value of each array index is the OID of the function. PostgreSQL supports overloaded functions (same name, different number and/or argument types). You can call overloaded functions with pgin.tcl by specifying the argument type list after the function name. See examples below. You must specify the argument list exactly like psql "\df" does - as a list of correct type names, separated by a single comma and space. There is currently no provision to distinguish functions by their return type. It doesn't seem like there are any PostgreSQL functions which differ only by return type. Before PostgreSQL-7.4, certain errors in fast-path calls (such as supplying the wrong number of arguments to the backend function) would cause the back-end and front-end to lose synchronization, and the channel would be closed. This was true about libpq as well. This has been fixed with the new protocol in PostgreSQL-7.4. Commands: pg_callfn $db "fname" result "arginfo" arg... Call a PostgreSQL backend function and store the result. Returns the size of the result in bytes. Parameters: $db is the connection handle. "fname" is the PostgreSQL function name. This is either a simple name, like "encode", or a name followed by a parenthesized argument type list, like "like(text, text)". The second form is needed to specify which of several overloaded functions you want to call. "result" is the name of a variable where the PostgreSQL backend function returned value is to be stored. The number of bytes stored in "result" is returned as the value of pg_callfn. "arginfo" is a list of argument descriptors. Each list element is one of the following: I An integer32 argument is expected. S A Tcl string argument is expected. The length of the string is used (remember Tcl strings can contain null bytes). n (an integer > 0) A Tcl string argument is expected, and exactly this many bytes of the string argument are passed (padding with null bytes if needed). arg... Zero or more arguments to the PostgreSQL function follow. The number of arguments must match the number of elements in the "arginfo" list. The values are passed to the backend function according to the corresponding descriptor in "arginfo". For PostgreSQL backend functions which return a single integer32 argument, the following simplified interface is available: pg_callfn_int $db "fname" "arginfo" arg... The db, fname, arginfo, and other arguments are the same as for pg_callfn. The return value from pg_callfn_int is the integer32 value returned by the PostgreSQL backend function. Examples: Note: These examples demonstrate the command, but in both of these cases you would be better off using an SQL query instead. set n [pg_callfn $db version result ""] This calls the backend function version() and stores the return value in $result and the result length in $n. pg_callfn $db encode result {S S} $str base64 This calls the backend function encode($str, "base64") with 2 string arguments and stores the result in $result. pg_callfn_int $db length(text) S "This is a test" This calls the backend function length("This is a test"). Because there are multiple functions called length(), the argument type list "(text)" must be given after the function name. The length of the string (14) is returned by the function. ----------------------------------------------------------------------------- PREPARE/BIND/EXECUTE Starting with PostgreSQL-7.4, access to separate Parse, Bind, and Execute steps are provided by the protocol. The Parse step can be replaced by an SQL PREPARE command. pgin.tcl provides support for this extended query protocol with pg_exec_prepared (introduced in pgin.tcl-2.0.0), and pg_exec_params (introduced in pgin.tcl-2.1.0). There is also a variation of pg_exec which provides a simplified interface to pg_exec_params. The main advantage of the extended query protocol is separation of parameters from the query text string. This avoids the need to quote and escape parameters, and may prevent SQL injection attacks. pg_exec_prepared also offers some performance advantages if a query can be prepared, parsed, and stored once and then execute multiple times without re-parsing. In addition to working with text parameters and results, the pg_exec_prepared and pg_exec_params commands support sending unescaped binary data to the server. (Fast-path function calls also support this.) These commands also support returning binary data to the client. (This can also be done with binary cursors.) Although the protocol definition and pgin.tcl commands support mixed text and binary results, libpq requires all result columns to be text, or all binary. Using mixed binary/text result columns will make your application incompatible with libpq-based versions of this interface. pg_exec_prepared is for execution of pre-prepared SQL statements after binding parameters. A named SQL statement must be prepared using the SQL "PREPARE" command before using pg_exec_prepared. An advantage of pg_exec_prepared is that the protocol-level Parse step requires the client to translate parameter types to OIDs, but using PREPARE lets the server determine the parameter argument types. pg_exec_prepared is modeled after the Libpq call: PQexecPrepared(). pg_exec_params does all three steps of the extended query protocol: parse, bind, and execute. Parameter types can be specified by type OID, or parameters can be based as text to be interpreted by the server as it does for any untyped literal string. To find the type OID of a PostgreSQL type '', you need to query the server like this: SELECT oid FROM pg_type where typname='' pg_exec_params is modeled after the Libpq call: PQexecParams(). A limitation of both pg_exec_prepared and pg_exec_params is lack of support for NULLs as parameter values. There is no way to pass a NULL parameter to the prepared statement. This is not a protocol or database limitation, but just lack of a good idea on how to implement the command interface to support NULLs without needlessly complication the more common case without NULLs. ----------------------------------------------------------------------------- MD5 AUTHENTICATION MD5 authentication was added at PostgreSQL-7.2. This is a challenge/response protocol which avoids having clear-text passwords passed over the network. To activate this, the PostgreSQL administrator puts "md5" in the pg_hba.conf file instead of "password". Pgin.tcl supports this transparently; that is, if the backend requests MD5 authentication during the connection, pg_connect will use this protocol. The MD5 implementation was coded by the original author of pgin.tcl. It does not use the tcllib implementation, which is significantly faster but much more complex. ----------------------------------------------------------------------------- ENCODING Character set encoding was added to pgin.tcl-3.0.0. More information can be found in the README and REFERENCE files. The following are converted to Unicode before being sent to PostgreSQL: + Query strings (pg_exec, and all higher-level commands which use it) + TEXT-format query parameters in pg_exec_prepared/pg_exec_params + All parameter arguments in pg_exec when query parameters are used + Prepared statement name in pg_exec_prepared + COPY table FROM STDIN data sent using pg_copy_write The following are converted from Unicode when received from PostgreSQL: + Query result column data when TEXT-format (not when BINARY-format) + All Error and Notice response strings + Parameter names and values + Notification messages + Command completion message + Query result field names (column names) + COPY table TO STDOUT data received using pg_copy_read Conversion of data to Unicode for sending to PostgreSQL occurs in 5 places in the code: pg_exec and pg_exec_params query strings, pg_exec_prepared statement name, pg_exec_prepared text format parameters, and when writing COPY FROM data in pg_copy_write. Conversion of Unicode data from PostgreSQL occurs in 3 places in the code: when receiving a protocol message "string" type (which covers various messages, parameters, and field names), when reading TEXT mode tuple data, and when reading COPY TO data in pg_copy_read. There is no Unicode conversion for the connection parameters username, database-name, or password. PostgreSQL seems to store these using the encoding of the database cluster/template1 database, which may differ from the encoding of the database to which the client is connected. It is unclear how to recode these characters. At this time, it is wise to avoid non-ASCII characters in database names, usernames, and passwords. This may be fixed in the future. The fast-path function call interface treats all its arguments as binary data and does not encode or decode them. The fast-path function calls were implemented primarily for large object support, and large object support is not affected by Unicode encoding because it is all binary data. It is unlikely that encoding support will be added to fast-path function calls, since parameterized queries are the preferred replacement. ----------------------------------------------------------------------------- ESCAPING An array pgtcl::std_str() is used to store the per-connection setting for the PostgreSQL setting standard_conforming_strings. This was added in Pgin.tcl-3.1.0 to support the versions of pg_escape_string, pg_quote, and pg_escape_bytea which accept an optional $conn argument. If the array value indexed by $conn is 1, then standard conforming strings is on for that database connection, and the backslash (\) is not considered special in SQL quoted string constants. In this case, pg_escape_string and pg_quote will not double backslashes. pg_escape_bytea will omit one level of backslashes when escaping backslash and octal values. If the array value indexed by $conn is 0, then standard conforming strings is off for that database and connection, and the backslash (\) is special in SQL quoted string constants. In that case, pg_escape_string and pg_quote will double backslashes. pg_escape_bytea will use 4 backslashes for a single backslash, and 2 backslashes in an octal value. There is also an array index "_default_" which is used when no $conn argument is supplied to the escape commands. Just as in libpq, the _default_ value is set any time a Set Parameter message for standard_conforming_strings is received over any open database connection. If you are using a single connection, or multiple connections with the same value for standard_conforming_strings, you will get correct escaping behavior even without using the $conn argument when escaping strings. -----------------------------------------------------------------------------