This is a discussion on Advice sought on best way to get data into a Linux db2 8.1.4 database within the DB2 forums, part of the Database Server Software category; --> Hi all, I'm looking for some advice on how best to implement storage of access logs into a db/2 ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi all, I'm looking for some advice on how best to implement storage of access logs into a db/2 8.1.4 database running on a RH 7.2 system. I have 5 (squid) web caches running here that service the whole university. All access to external web sites must go through these caches. Each cache generates a gzip'd access log file that's about 100Mbytes every night. At the moment I'm ftp'ing them to a central system each night for processing which generates a set of html stat files that are about 1.2Gbytes - that's per night. Needless to say at that rate the 36 Gbyte disk space I've assigned to it doesn't last very long.I'm therefore looking for a way of transferring the data into a back end database that I can access via a web interface that makes use of stored procedures,java beans and jsp pages. I probably don't have to dump the data into the db in real time so I could just post process the existing access log files every night. Having said that, updating the database in real time would save a lot of disk space. I can create a named pipe on the linux box that the squid caching process writes to and have the other end connected to a process that munges the data round and writes it into a database. The code I've got ( not mine) is written in perl and writes data to a mysql database and it would (i think) be a trivial task to write the data into a db/2 back end instead The other option is to cat an existing log through the same prog and jsut update the info off line. All the web caches are running RH 8.0 with 4 Gbytes of RAM 4 with 2 * Gbit/s network links in a trunked configuration so the DB/2 server would be receiving input from 5 caches simultaneously I suppose what i'm asking is what would be the quickest way of getting the data into the database. Stick with perl/DBI ?, java prog to process the input (doesn't feel as if this would be the quickest way of doing things) piping a data file through a db2 cli interface? or something else? Any help suggestions appreciated alex |
| |||
| It is pretty quick to use the Perl/DBI to load data into db2. I loaded about 5GB data (CSV format) into a backend Linux DB2 in just 10 hrs (the hardware is slightly slower than yours). The only disadvantage is the transction control. You have to add this to MySQL Perl code ( MySQL doesn't support transaction by default). An alternative is to use DB2 CLI to do batch loading. The LOAD facility is very quick, in my case it took only 3 hrs to finish the loading and plus integrity check (depends on your table definition and the size of your data). But if you're not confident with the raw data, you'd better use IMPORT, which inserts data row by row and it took me more than 12 hrs to complete the task. May this help. Bing Alex wrote: > Hi all, > > I'm looking for some advice on how best to implement storage of access > logs into a db/2 8.1.4 database running on a RH 7.2 system. > > I have 5 (squid) web caches running here that service the whole > university. All access to external web sites must go through these > caches. Each cache generates a gzip'd access log file that's about > 100Mbytes every night. > > At the moment I'm ftp'ing them to a central system each night for > processing which generates a set of html stat files that are about > 1.2Gbytes - that's per night. > Needless to say at that rate the 36 Gbyte disk space I've assigned to > it doesn't last very long.I'm therefore looking for a way of > transferring the data into a back end database that I can access via a > web interface that makes use of stored procedures,java beans and jsp > pages. I probably don't have to dump the data into the db in real time > so I could just post process the existing access log files every > night. Having said that, updating the database in real time would save > a lot of disk space. > > I can create a named pipe on the linux box that the squid caching > process writes to and have the other end connected to a process that > munges the data round and writes it into a database. The code I've got > ( not mine) is written in perl and writes data to a mysql database and > it would (i think) be a trivial task to write the data into a db/2 > back end instead > > The other option is to cat an existing log through the same prog and > jsut update > the info off line. > > All the web caches are running RH 8.0 with 4 Gbytes of RAM 4 with 2 * > Gbit/s network links in a trunked configuration so the DB/2 server > would be receiving input from 5 caches simultaneously > > I suppose what i'm asking is what would be the quickest way of getting > the data into the database. Stick with perl/DBI ?, java prog to > process the input (doesn't feel as if this would be the quickest way > of doing things) piping a data file through a db2 cli interface? or > something else? > > Any help suggestions appreciated > > alex |
| |||
| off the topic - make sure that you siut on a supported LInux distro. We validated RH7.2 but RH took it off support (and the same is true for RH8, RH9). If you are not on a RHEL versuion they are not willing to give you any support and we can not help from the DB2 side fixing their kernel issues (even if we would ilke to sometimes). Boris "Alex" <A.Sharaz@hull.ac.uk> wrote in message news:b270c055.0312050946.59a179c0@posting.google.c om... > Hi all, > > I'm looking for some advice on how best to implement storage of access > logs into a db/2 8.1.4 database running on a RH 7.2 system. > > I have 5 (squid) web caches running here that service the whole > university. All access to external web sites must go through these > caches. Each cache generates a gzip'd access log file that's about > 100Mbytes every night. > > At the moment I'm ftp'ing them to a central system each night for > processing which generates a set of html stat files that are about > 1.2Gbytes - that's per night. > Needless to say at that rate the 36 Gbyte disk space I've assigned to > it doesn't last very long.I'm therefore looking for a way of > transferring the data into a back end database that I can access via a > web interface that makes use of stored procedures,java beans and jsp > pages. I probably don't have to dump the data into the db in real time > so I could just post process the existing access log files every > night. Having said that, updating the database in real time would save > a lot of disk space. > > I can create a named pipe on the linux box that the squid caching > process writes to and have the other end connected to a process that > munges the data round and writes it into a database. The code I've got > ( not mine) is written in perl and writes data to a mysql database and > it would (i think) be a trivial task to write the data into a db/2 > back end instead > > The other option is to cat an existing log through the same prog and > jsut update > the info off line. > > All the web caches are running RH 8.0 with 4 Gbytes of RAM 4 with 2 * > Gbit/s network links in a trunked configuration so the DB/2 server > would be receiving input from 5 caches simultaneously > > I suppose what i'm asking is what would be the quickest way of getting > the data into the database. Stick with perl/DBI ?, java prog to > process the input (doesn't feel as if this would be the quickest way > of doing things) piping a data file through a db2 cli interface? or > something else? > > Any help suggestions appreciated > > alex |
| ||||
| A.Sharaz@hull.ac.uk (Alex) wrote in message news:<b270c055.0312050946.59a179c0@posting.google. com>... > I suppose what i'm asking is what would be the quickest way of getting > the data into the database. Stick with perl/DBI ?, java prog to > process the input (doesn't feel as if this would be the quickest way > of doing things) piping a data file through a db2 cli interface? or > something else? I'd recommend a typical warehouse etl solution. However, rather than go the route of commercial etl tools (especially for such a small project), I'd implement with simple commodity components: - sftp the gzipped files to the warehouse server - transform the files from extract to load images using an easily-maintained language such as python (or whatever works best for you) - use the db2 load utility for loading into base fact tables - archive both extract and load files in a gzipped format in case you want to recover or change your data model. If you go this route then there are only two challenges initially: - learning the ins & outs of the db2 load utility - process management The load utility isn't that bad - just need to get familiar with how it locks the table, load recovery, etc. And it's very fast - a small box using load can easily hit 20,000 rows / sec - far faster than anything you could do with java, perl, etc. Process management is trickier. But if you only need to do loads once a day it isn't too tough. I normally just keep the extract, transport, transform, and load processes completely separate - each run once a minute via cron, each ensuring that only a single copy is running, and each interfacing to the other processes only via a flat file. Need to ensure that files aren't picked up until they're complete - so a signally file, or a file rename should be performed at the conclusion of a successful process. However, an initial implemention (given just 5 files a day) could be far simplier - based on a static schedule, and an alerting process in the event that those files aren't available when the downstream processes get kicked off. Developed in this fashion, a simple log fact table solution could be developed in 1-3 days, and then easily enhanced later if you wanted. ken |