vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Heikki, I see some strange changes in your patch, not related to tsearch at all: contrib/pageinspect/pageinspect.sql.in contrib/pageinspect/rawpage.c > The usage of the QueryItem struct was very confusing. It was used for > both operators and operands. For operators, "val" was a single character > casted to a int4, marking the operator type. For operands, val was the > CRC-32 of the value. Other fields were used only either for operands or > for operators. The biggest change in the patch is that I broke the > QueryItem struct into QueryOperator and QueryOperand. Type was really .... > - Removed ParseQueryNode struct used internally by makepol and friends. > push*-functions now construct QueryItems directly. It's needed to set unused bytes in QueryItem to zero, it's common requiremens for types in pgsql. After allocating space for tsquery in parse_tsquery you copy just sizeof(QueryOperator) bytes and leave sizeof(QueryItem) - sizeof(QueryOperator) bytes untouched. QueryOperand is a biggest component in QueryItem union. I don't check other places. > that? And parse_query always produces trees that are in prefix notation, > so the left-field is really redundant, but using tsqueryrecv, you could > inject queries that are not in prefix notation; is there anything in the > code that depends on that? It's used by TS_execute for optimization reason. With clear postfix notation you should go through every nodes. For example: FALSE FALSE & FALSE & You will go to the end of query to produce correct result. In fact, TSQuery is a prefix notation with pointer to another operand or, by another words, just a plain view of tree where right operand of operation is always placed after operation. That notation allows to calculate only one of operand if it possible: & FALSE & FALSE FALSE 1 2 3 4 5 --Nodes After evaluating of second node you can return FALSE for whole expression and do not evaluate nodes 3-5. For query & TRUE & FALSE & FALSE it's needed to evaluate 1,2,3,4 nodes. In most cases checking QI_VAL node is much more expensive that QI_OPR > > - There's many internal intermediate representations of a query: > TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix > notation stack of QueryItems (in parser), infix-tree. Could we remove > some of these? I havn't strong objections, QTNode and NODE are tree-like structures, but TSQuery is a postfix notation for storage in plain memory. NODE is used only cleanup stop-word placeholders, so it's a binary tree while QTNode represents t-ary tree (with any number of children). Thank you for your interesting in tsearch - after recheck of problem pointed above I'll commit your patch. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Teodor Sigaev wrote: > Heikki, I see some strange changes in your patch, not related to tsearch > at all: > contrib/pageinspect/pageinspect.sql.in > contrib/pageinspect/rawpage.c Oh, sorry about that. Just ignore them, they are some quick&dirty debugging functions I had in the same source tree. >> The usage of the QueryItem struct was very confusing. It was used for >> both operators and operands. For operators, "val" was a single character >> casted to a int4, marking the operator type. For operands, val was the >> CRC-32 of the value. Other fields were used only either for operands or >> for operators. The biggest change in the patch is that I broke the >> QueryItem struct into QueryOperator and QueryOperand. Type was really > ... >> - Removed ParseQueryNode struct used internally by makepol and friends. >> push*-functions now construct QueryItems directly. > > It's needed to set unused bytes in QueryItem to zero, it's common > requiremens for types in pgsql. After allocating space for tsquery in > parse_tsquery you copy just sizeof(QueryOperator) bytes and leave > sizeof(QueryItem) - sizeof(QueryOperator) bytes untouched. QueryOperand > is a biggest component in QueryItem union. I don't check other places. Ok. Probably easiest to do that by changing the palloc to palloc0 in parse_tsquery. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| > Ok. Probably easiest to do that by changing the palloc to palloc0 in > parse_tsquery. and change sizeof to sizeof(QueryItem) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Teodor Sigaev wrote: >> Ok. Probably easiest to do that by changing the palloc to palloc0 in >> parse_tsquery. > > and change sizeof to sizeof(QueryItem) Do you mean the sizeofs in the memcpys in parse_tsquery? You can't change them. The source structs are allocated in pushOperand/pushOperator, using sizeof(QueryOperand/QueryOperator), so if you copy sizeof(QueryItem) bytes from them, you'll copy some random piece of memory. If you just allocate the TSQuery with palloc0, that'll make sure that any memory area not copied into in the loop will be zero. BTW, can you explain what the CRC-32 of a value is used for? It looks like it's used to speed up some operations, by comparing the CRCs before comparing the values, but I didn't quite figure out how it works. How much of a performance difference does it make? Would hash_any do a better/cheaper job? In any case, I think we need to calculate the CRC/hash in tsqueryrecv, instead of trusting the client. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Heikki Linnakangas wrote: > Teodor Sigaev wrote: >>> Ok. Probably easiest to do that by changing the palloc to palloc0 in >>> parse_tsquery. >> and change sizeof to sizeof(QueryItem) > > Do you mean the sizeofs in the memcpys in parse_tsquery? You can't Oops, I meant pallocs in push* function. palloc0 in parse_tsquery is another way. > > BTW, can you explain what the CRC-32 of a value is used for? It looks > like it's used to speed up some operations, by comparing the CRCs before > comparing the values, but I didn't quite figure out how it works. How It's mostly used in GiST indexes - recalculating crc32 every time for each index tuple to be checked is rather expensive. > much of a performance difference does it make? Would hash_any do a > better/cheaper job? crc32 was chosen after testing a lot of hash function. Perl's hash was the fastest, but crc32 makes much less number of collisions. That's interesting for ASCII a lot of functions produce rather small number of collision, but for upper part of table (0x7f-0xff) crc32 was the best. CRC32 has evenly distributed collisions over characters, others - not. > In any case, I think we need to calculate the CRC/hash in tsqueryrecv, > instead of trusting the client. Agreed. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Teodor Sigaev wrote: > Heikki Linnakangas wrote: >> In any case, I think we need to calculate the CRC/hash in tsqueryrecv, >> instead of trusting the client. > Agreed. I started to write a patch for that, as I realized that we're transferring the strings in a tsvector/tsquery in server encoding. That's not good, right? A dump created with COPY ... BINARY wouldn't be portable across clusters with different encodings, for example. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > I started to write a patch for that, as I realized that we're > transferring the strings in a tsvector/tsquery in server encoding. > That's not good, right? A dump created with COPY ... BINARY wouldn't be > portable across clusters with different encodings, for example. Any portion of a binary value that is considered textual should be converted to and from client encoding --- cf textsend/textrecv. This should be pretty trivial to fix, just call a different support routine. BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes, or do you want someone else to do it? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Tom Lane wrote: > Any portion of a binary value that is considered textual should be > converted to and from client encoding --- cf textsend/textrecv. > This should be pretty trivial to fix, just call a different support > routine. You do need to adjust length and position fields in the structs as well. I fixed (rewrote, almost) the send/recv functions, and added a comment above them describing the on-wire format. The CRC is now recalculated in tsquery as well per previous discussion. Patch attached. This is on top of the previous patches I sent. It includes some additional changes that I had already started with. Most notably: - change the alignment requirement of lexemes in TSVector slightly. Lexeme strings were always padded to 2-byte aligned length to make sure that if there's position array (uint16[]) it has the right alignment. The patch changes that so that the padding is not done when there's no positions. That makes the storage of tsvectors without positions slightly more compact. - added some #include "miscadmin.h" lines I missed in the earlier when I added calls to check_stack_depth(). BTW, the encoding of the XML datatype looks pretty funky. xml_recv first reads the xml string with pq_getmsgtext, which applies a client->server conversion. Then the xml declaration is parsed, extracting the encoding attribute. Then the string is converted again from that encoding (or UTF-8 if none was specified) to server encoding. I don't understand how it's supposed to work, but ISTM there's one conversion too much, > BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes, > or do you want someone else to do it? I am getting confused with the patches and version I have lying around here... I think I'll have to wait for review of the patches I've posted this far before I continue hacking. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Heikki Linnakangas wrote: > BTW, the encoding of the XML datatype looks pretty funky. xml_recv first > reads the xml string with pq_getmsgtext, which applies a client->server > conversion. Then the xml declaration is parsed, extracting the encoding > attribute. Then the string is converted again from that encoding (or > UTF-8 if none was specified) to server encoding. I don't understand how > it's supposed to work, but ISTM there's one conversion too much, And it's got an unfortunate typo in it as well: it calls "free(result)" instead of pfree. I think we need regression tests for the more complex send/recv functions... What's the difference between text and binary mode for something like xml anyway? Could we just call the text format in/out functions and be done with it? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| ||||
| > BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes, > or do you want someone else to do it? I'll do it. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| Thread Tools | |
| Display Modes | |
|
|