vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I got a chance to review this patch over the weekend. Basic API seems good, met all my requirements, no surprises with how the GUC variable controlled the feature. The most fundamental issue I have with the interface is that using COPY makes it difficult to put any unique index on the resulting table. I like to have a unique index on my imported log table because it rejects the dupe records if you accidentally import the same section of log file twice. COPY tosses the whole thing if there's an index violation, which is a problem during a regular import because you will occasionally come across lines with the same timestamp that are similar in every way except for their statment; putting an index on the timestamp+statement seems impractical. I've had a preference for INSERT from the beginning here that this reinforces. I'm planning to just work around this issue by doing the COPY into a temporary table and then INSERTing from there. I didn't want to just let the concern pass by without mentioning it though. It crosses my mind that inserting some sort of unique log file line ID number would prevent the dupe issue and make for better ordering (it's possible to have two lines with the same timestamp show up in the wrong order now), not sure that's a practical idea to consider. The basic coding of the patch seemed OK to me, but someone who is much more familiar than myself with the mechanics of pipes should take a look at that part of the patch before committing; it's complicated code and I can't comment on it. There are some small formatting issues that need to be fixed, particularly in the host+port mapping. I can fix those myself and submit a slightly updated patch. There's some documentation improvements I want to make before this goes in as well. The patch is actually broken fairly hard right now because of the switch from INSERT to COPY FROM CSV as the output format at the last minute. It outputs missing fields as NULL (fine for INSERT) that chokes the CSV import when the session_start timestamp is missing. All of those NULL values need to be just replaced with nothing for proper CSV syntax; there should just the comma for the next field. I worked around this with copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL'; I can fix that too when I'm revising. I plan to have a version free of obvious bugs to re-submit ready by next weekend. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Greg Smith wrote: > I got a chance to review this patch over the weekend. Basic API seems > good, met all my requirements, no surprises with how the GUC variable > controlled the feature. > > The most fundamental issue I have with the interface is that using > COPY makes it difficult to put any unique index on the resulting > table. I like to have a unique index on my imported log table because > it rejects the dupe records if you accidentally import the same > section of log file twice. COPY tosses the whole thing if there's an > index violation, which is a problem during a regular import because > you will occasionally come across lines with the same timestamp that > are similar in every way except for their statment; putting an index > on the timestamp+statement seems impractical. Does the format not include the per-process line number? (I know i briefly looked at this patch previously, but I forget the details.) One reason I originally included line numbers in log_line-prefix was to handle this sort of problem. > > I've had a preference for INSERT from the beginning here that this > reinforces. COPY is our standard bulk insert mechanism. I think arguing against it would be a very hard sell. > I'm planning to just work around this issue by doing the COPY into a > temporary table and then INSERTing from there. I didn't want to just > let the concern pass by without mentioning it though. It crosses my > mind that inserting some sort of unique log file line ID number would > prevent the dupe issue and make for better ordering (it's possible to > have two lines with the same timestamp show up in the wrong order > now), not sure that's a practical idea to consider. I guess that answers my question. We should definitely provide a unique line key. > > The basic coding of the patch seemed OK to me, but someone who is much > more familiar than myself with the mechanics of pipes should take a > look at that part of the patch before committing; it's complicated > code and I can't comment on it. There are some small formatting > issues that need to be fixed, particularly in the host+port mapping. > I can fix those myself and submit a slightly updated patch. There's > some documentation improvements I want to make before this goes in as > well. > > The patch is actually broken fairly hard right now because of the > switch from INSERT to COPY FROM CSV as the output format at the last > minute. It outputs missing fields as NULL (fine for INSERT) that > chokes the CSV import when the session_start timestamp is missing. > All of those NULL values need to be just replaced with nothing for > proper CSV syntax; there should just the comma for the next field. I > worked around this with > > copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL'; > > I missed that before - yes it should be fixed. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Greg Smith <gsmith@gregsmith.com> writes: > The most fundamental issue I have with the interface is that using COPY > makes it difficult to put any unique index on the resulting table. I like > to have a unique index on my imported log table because it rejects the > dupe records if you accidentally import the same section of log file > twice. COPY tosses the whole thing if there's an index violation, which > is a problem during a regular import because you will occasionally come > across lines with the same timestamp that are similar in every way except > for their statment; putting an index on the timestamp+statement seems > impractical. Essentially the above is arguing that you want a unique index but you can't be bothered to invent an actually-unique key. This doesn't seem a sound argument to me. If we need a unique key, let's find one. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Sun, 20 May 2007, Andrew Dunstan wrote: > Does the format not include the per-process line number? It does not, and I never noticed that under the prefix possibilities---never seemed import before! The combination of timestamp/pid/line (%t %p %l) looks like a useful and unique key here, so I'll add another column for the line number to the output. Thanks for pointing that out, I can finish cleaning up all the functional implementation work on this patch myself now. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Sun, 20 May 2007, Andrew Dunstan wrote: >> I've had a preference for INSERT from the beginning here that this >> reinforces. > COPY is our standard bulk insert mechanism. I think arguing against it would > be a very hard sell. Let me say my final peace on this subject...if I considered this data to be strictly bulk insert, then I'd completely agree here. Most of the really interesting applications I was planning to build on top of this mechanism are more interactive than that though. Here's a sample: -Write a daemon that lives on the server, connects to a logging database, and pops into an idle loop based on LISTEN. -A client app wants to see the recent logs files. It uses NOTIFY to ask the daemon for them and LISTENs for a response. -The daemon wakes up, reads all the log files since it last did something, and appends those records to the log file table. It sends out a NOTIFY to say the log file table is current. That enables remote clients to grab the log files from the server whenever they please, so they can actually monitor themselves. Benchmarking is the initial app I expect to call this, and with some types of tests I expect the daemon to be importing every 10 minutes or so. Assuming a unique index on the data to prevent duplication is a required feature, I can build this using the COPY format logs as well, but that requires I either a) am 100% perfect in making sure I never pass over the same data twice, which is particularly annoying when the daemon gets restarted, or b) break the COPY into single lines and insert them one at a time, at which point I'm not bulk loading at all. If these were INSERT statements instead, I'd have a lot more tolerance for error, because the worst problem I'd ever run into is spewing some unique key violation errors into the logs if I accidentally imported too much. With COPY, any mistake or synchronization issue and I lose the whole import. I don't mean to try and stir this back up again as an argument (particularly not on this list). There are plenty of other apps where COPY is clearly the best approach, you can easily make a case that my app is a fringe application rather than a mainstream one, and on the balance this job is still far easier than my current approach of parsing the logs. I just wanted to give a sample of how using COPY impacts the dynamics of how downstream applications will have to work with this data, so you can see that my contrary preference isn't completely random here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Greg Smith wrote: > The attached patch fixes all the issues I found in the original > version of this code and completes the review I wanted to do. Someone > else will need to take this from here. As I already mentioned, I > can't comment on the quality of the piping implementation used to add > this feature other than to say it worked for me. I'll take it from here. > > > -Added a new documentation section to the logging chapter devoted just > to the csvlog feature. It gives a sample table and import syntax. I > also gave recommendations on how to configure some related log file > parameters that can interact badly with this feature. For example, I > noticed that if log_rotation_size was set to a value, it could split > the CSV lines in two; the result was two CVS files you couldn't import > because of the partial lines in each. Since the rotation size feature > causes other issues anyway that make importing more complicated, > documenting the issue seemed sufficient. > > What are the other issues? I'm not happy about producing files with split lines. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Fri, 1 Jun 2007, Andrew Dunstan wrote: > Greg Smith wrote: >> Since the rotation size feature causes other issues anyway >> that make importing more complicated, documenting the issue seemed >> sufficient. > > What are the other issues? I'm not happy about producing files with split > lines. Just that it's fairly simple and predictable to know what your log files are going to be named and when they'll rollover if you use something like a date-based naming convention--but when you add size-based rotation into the mix figuring out what files you need to import and when you should import them gets more complicated. Clearly fixing this issue altogether would be better, and I gather the problem may extend to any time there is a switch to a new log file; my "workaround" doesn't appear good enough anyway. I'm very glad I caught and mentioned this now. Because of the extra naming/import complexity, it still might be worthwhile to suggest people not combine size-based rotation and the csvlog though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| ||||
| Greg Smith wrote: > The attached patch fixes all the issues I found in the original > version of this code and completes the review I wanted to do. Someone > else will need to take this from here. As I already mentioned, I > can't comment on the quality of the piping implementation used to add > this feature other than to say it worked for me. The code below strikes me as being seriously broken. The comment says that it doubles all single quotes, double quotes and backslashes. It doesn't (backslashes are not doubled) and in any case doing that would simply be wrong (with or without backslashes). The ONLY things that should be doubled are double quotes. Period. That part is easy to fix. But here's my question: why are we worrying at all about things like the encoding? Especially, why are we worrying about the *client* encoding for a server log file? We surely aren't going to switch encodings in the middle of a log file! ISTM that this routine should simply copy the input, byte for byte, apart from doubling the dquotes. Does that make sense? I assume that this routine has been copied from somewhere else without sufficient regard to the context (or the logic). The code also illustrates something else that annoyed me elsewhere in the patch and that I have eliminated, namely use of a macro placed in the header file and then used in exactly one place in the code. The macro didn't make anything clearer - quite the reverse. ISTM that a macro used only in one file should be defined in that file, generally, and if it's used only once is probably not much use anyway. cheers andrew > + > + /* > + * Escapes special characters in the string to conform > + * with the csv type output. > + * Replaces " with "", ' with '' and \ with \\. > + */ > + static size_t > + escape_string_literal(char *to, const char *from) > + { > + const char *source = from; > + char *target = to; > + size_t remaining = 0; > + int client_encoding = 0; > + > + if (from == NULL) > + return remaining; > + > + remaining = strlen(from); > + client_encoding = pg_get_client_encoding(); > + > + while (remaining > 0 && *source != '\0') > + { > + char c = *source; > + int len; > + int i; > + > + /* Fast path for plain ASCII */ > + if (!IS_HIGHBIT_SET(c)) > + { > + /* Apply quoting if needed */ > + if (CSV_STR_DOUBLE(c, false)) > + *target++ = c; > + /* Copy the character */ > + *target++ = c; > + source++; > + remaining--; > + continue; > + } > + > + /* Slow path for possible multibyte characters */ > + len = pg_encoding_mblen(client_encoding, source); > + > + /* Copy the character */ > + for (i = 0; i < len; i++) > + { > + if (remaining == 0 || *source == '\0') > + break; > + *target++ = *source++; > + remaining--; > + } > + > + /* > + * If we hit premature end of string (ie, incomplete multibyte > + * character), try to pad out to the correct length with spaces. > + * We may not be able to pad completely, but we will always be > + * able to insert at least one pad space (since we'd not have > + * quoted a multibyte character). This should be enough to make > + * a string that the server will error out on. > + */ > + if (i < len) > + { > + for (; i < len; i++) > + { > + if (((size_t) (target - to)) / 2 >= strlen(from)) > + break; > + *target++ = ' '; > + } > + break; > + } > + } > + > + /* Write the terminating NUL character. */ > + *target = '\0'; > + > + return target - to; > + } > > ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |