Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 11:04 PM
Andreas Strasser
 
Posts: n/a
Default Design Question (Time Series Data)

Hello,

i'm currently designing an application that will retrieve economic data
(mainly time series)from different sources and distribute it to clients.
It is supposed to manage around 20.000 different series with differing
numbers of observations (some have only a few dozen observations, others
several thousand) and i'm now faced with the decision where and how to
store the data.

So far, i've come up with 3 possible solutions

1) Storing the observations in one big table with fields for the series,
position within the series and the value (float)
2) Storing the observations in an array (either in the same table as the
series or in an extra data-table)
3) Storing the observations in CSV-files on the hard disk and only
putting a reference to it in the database

I expect that around 50 series will be updated daily - which would mean
that for solution nr. 1 around 50.000 rows would be deleted and appended
(again) every day.

I personally prefer solution 1, because it is the easiest to implement
(i need to make different calculations and be able to transform the data
easily), but i'm concerned about perfomance and overhead. It effectively
triples the space needed (over solutions nr. 2) and will result in huge
index files.

Are there any other storage methods which are better suited for this
kind of data? How can i avoid trouble resulting from the daily updates
(high number of deleted rows)? Which method would you prefer?

Thanks in advance!

Andreas

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 11:04 PM
Pavel Stehule
 
Posts: n/a
Default Re: Design Question (Time Series Data)

2007/10/4, Andreas Strasser <kontakt@andreas-strasser.com>:
> Hello,
>
> i'm currently designing an application that will retrieve economic data
> (mainly time series)from different sources and distribute it to clients.
> It is supposed to manage around 20.000 different series with differing
> numbers of observations (some have only a few dozen observations, others
> several thousand) and i'm now faced with the decision where and how to
> store the data.
>
> So far, i've come up with 3 possible solutions
>
> 1) Storing the observations in one big table with fields for the series,
> position within the series and the value (float)
> 2) Storing the observations in an array (either in the same table as the
> series or in an extra data-table)
> 3) Storing the observations in CSV-files on the hard disk and only
> putting a reference to it in the database
>


I did good experience with 2 variant. PostgreSQL needs 24bytes for
head of every row, so isn't too much efective store one field to one
row. You can simply do transformation between array and table now.

Pavel

> I expect that around 50 series will be updated daily - which would mean
> that for solution nr. 1 around 50.000 rows would be deleted and appended
> (again) every day.
>
> I personally prefer solution 1, because it is the easiest to implement
> (i need to make different calculations and be able to transform the data
> easily), but i'm concerned about perfomance and overhead. It effectively
> triples the space needed (over solutions nr. 2) and will result in huge
> index files.
>
> Are there any other storage methods which are better suited for this
> kind of data? How can i avoid trouble resulting from the daily updates
> (high number of deleted rows)? Which method would you prefer?
>
> Thanks in advance!
>
> Andreas
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 11:04 PM
Jorge Godoy
 
Posts: n/a
Default Re: Design Question (Time Series Data)

On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote:
>
> I did good experience with 2 variant. PostgreSQL needs 24bytes for
> head of every row, so isn't too much efective store one field to one
> row. You can simply do transformation between array and table now.


But then you'll make all SQL operations complex, you will have problems using
aggregators, etc. For example, with a normalized design one can query the
average value of a specific serie using simple commands and given the use of
indices this could be highly optimized. Now, using an array, he'll be doing
a seqscan on every row because he needs to find if there was a value for the
given series, then he'll need extracting those values and finally calculating
the average (I know you can select an element of the array, but it won't be
easy on the planner or the loop to calculate the average because they'll need
to do all that and on every row).

I'd use the same solution that he was going to: normalized table including a
timestamp (with TZ because of daylight saving times...), a column with a FK
to a series table and the value itself. Index the two first columns (if
you're searching using the value as a parameter, then index it as well) and
this would be the basis of my design for this specific condition.

Having good statistics and tuning autovacuum will also help a lot on handling
new inserts and deletes.

--
Jorge Godoy <jgodoy@gmail.com>


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-09-2008, 11:04 PM
Pavel Stehule
 
Posts: n/a
Default Re: Design Question (Time Series Data)

2007/10/4, Jorge Godoy <jgodoy@gmail.com>:
> On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote:
> >
> > I did good experience with 2 variant. PostgreSQL needs 24bytes for
> > head of every row, so isn't too much efective store one field to one
> > row. You can simply do transformation between array and table now.

>
> But then you'll make all SQL operations complex, you will have problems using
> aggregators, etc. For example, with a normalized design one can query the
> average value of a specific serie using simple commands and given the use of
> indices this could be highly optimized. Now, using an array, he'll be doing
> a seqscan on every row because he needs to find if there was a value for the
> given series, then he'll need extracting those values and finally calculating
> the average (I know you can select an element of the array, but it won't be
> easy on the planner or the loop to calculate the average because they'll need
> to do all that and on every row).
>
> I'd use the same solution that he was going to: normalized table including a
> timestamp (with TZ because of daylight saving times...), a column with a FK
> to a series table and the value itself. Index the two first columns (if
> you're searching using the value as a parameter, then index it as well) and
> this would be the basis of my design for this specific condition.
>
> Having good statistics and tuning autovacuum will also help a lot on handling
> new inserts and deletes.
>


It's depend on work. Somewhere normalised solution can be better,
somewhere not. But I belive, if you have lot of timeseries, than
arrays is better. But I repeat, it's depend on task.

Pavel

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-09-2008, 11:04 PM
Andreas Strasser
 
Posts: n/a
Default Re: Design Question (Time Series Data)

Pavel Stehule schrieb:
> 2007/10/4, Jorge Godoy <jgodoy@gmail.com>:
>> On Thursday 04 October 2007 06:20:19 Pavel Stehule wrote:
>>
>> I'd use the same solution that he was going to: normalized table including a
>> timestamp (with TZ because of daylight saving times...), a column with a FK
>> to a series table and the value itself. Index the two first columns (if
>> you're searching using the value as a parameter, then index it as well) and
>> this would be the basis of my design for this specific condition.
>>
>> Having good statistics and tuning autovacuum will also help a lot on handling
>> new inserts and deletes.
>>

>
> It's depend on work. Somewhere normalised solution can be better,
> somewhere not. But I belive, if you have lot of timeseries, than
> arrays is better. But I repeat, it's depend on task.
>
> Pavel
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


Thanks for your input so far. Maybe i should add a few things about what
i will do with the data. There are only a few operations that will be
done in the database:

a) retrieving a slice or the whole series
b) changing the frequency of the series
c) grouping several series (with same time frame/frequency) together in
a result set
d) calculating moving averages and other econometrics stuff :-)

I will always now which series i want (i.e. there will be no case where
i'm searching for a value within the series).

Two questions regarding the arrays: Do you know if these are really
dynamic (e.g. if i have two rows, one with an array with 12 values and
the other one with 1,000 values - will postgres pad the shorter row?)
and is there an built-in function to retrieve arrays as rows (i know
that you can build your own function for that, but i wonder whether
there is a faster native function)

Thank you very much!

Andreas
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-09-2008, 11:04 PM
Ted Byers
 
Posts: n/a
Default Re: Design Question (Time Series Data)


--- Andreas Strasser <kontakt@andreas-strasser.com>
wrote:

> Hello,
>
> i'm currently designing an application that will
> retrieve economic data
> (mainly time series)from different sources and
> distribute it to clients.
> It is supposed to manage around 20.000 different
> series with differing
> numbers of observations (some have only a few dozen
> observations, others
> several thousand) and i'm now faced with the
> decision where and how to
> store the data.


If you really have such a disparity among your series,
then it is a mistake to blend them into a single
table. You really need to spend more time analyzing
what the data means. If one data set is comprised of
the daily close price of a suite of stocks or mutual
funds, then it makes sense to include all such series
in a given table, but if some of the series are daily
close price and others are monthly averages, then it
is a mistake to combine them in a single table, and
two or more would be warranted. Or if the data are
from different data feed vendors, then you have to
think very carefully whether or not the data can
logically be combined.

>
> So far, i've come up with 3 possible solutions
>
> 1) Storing the observations in one big table with
> fields for the series,
> position within the series and the value (float)
> 2) Storing the observations in an array (either in
> the same table as the
> series or in an extra data-table)
> 3) Storing the observations in CSV-files on the hard
> disk and only
> putting a reference to it in the database
>

I don't much like any of the above. When I have had
to process data for financial consultants, I applied a
few simple filters to ensure the data is clean (e.g.
tests to ensure data hasn't been corrupted during
transmission, proper handling of missing data, &c.),
and then bulk loaded the data into a suite of tables
designed specifically to match the vendor's
definitions of what the data means. Only then did we
apply specific analyses designed in consultation with
the financial consultant's specialists; folk best
qualified to help us understand how best to understand
the data and especially how it can be combined in a
meaningful way.

If the data are stored in a suite of well defined
tables, subsequent analyses are much more easily
designed, implemented and executed.

I do not know if PostgreSQL, or any other RDBMS,
includes the ability to call on software such as "R"
to do specific statistical analysis, but if I had to
do some time series analysis, I would do it in a
client application that retrieves the appropriate data
from the database and either does the analysis in
custom code I have written (usually in C++, as some of
my favourite analyses have not made it into commonly
available open source or commercial statistical
software) or invokes the appropriate functions from
statistical software I have at my disposal. The
strategy I describe above makes the SQL required for
much of this dirt simple.

HTH

Ted

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-09-2008, 11:04 PM
Michael Glaesemann
 
Posts: n/a
Default Re: Design Question (Time Series Data)


On Oct 4, 2007, at 9:30 , Ted Byers wrote:

> I do not know if PostgreSQL, or any other RDBMS,
> includes the ability to call on software such as "R"


See PL/R:

http://www.joeconway.com/plr/

Michael Glaesemann
grzm seespotcode net



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-09-2008, 11:04 PM
Andreas Strasser
 
Posts: n/a
Default Re: Design Question (Time Series Data)

Ted Byers wrote:

> If you really have such a disparity among your series,
> then it is a mistake to blend them into a single
> table. You really need to spend more time analyzing
> what the data means. If one data set is comprised of
> the daily close price of a suite of stocks or mutual
> funds, then it makes sense to include all such series
> in a given table, but if some of the series are daily
> close price and others are monthly averages, then it
> is a mistake to combine them in a single table, and
> two or more would be warranted. Or if the data are
> from different data feed vendors, then you have to
> think very carefully whether or not the data can
> logically be combined.


> I don't much like any of the above. When I have had
> to process data for financial consultants, I applied a
> few simple filters to ensure the data is clean (e.g.
> tests to ensure data hasn't been corrupted during
> transmission, proper handling of missing data, &c.),
> and then bulk loaded the data into a suite of tables
> designed specifically to match the vendor's
> definitions of what the data means. Only then did we
> apply specific analyses designed in consultation with
> the financial consultant's specialists; folk best
> qualified to help us understand how best to understand
> the data and especially how it can be combined in a
> meaningful way.


Thank you for your long and detailed answer. It would be possible for me
to group those series by frequency or vendor (or even by size), but this
would make sql-queries really complicated. Grouping the series by theme
would probably work but be a real pain and i doubt whether this has a
real advantage for my application.

The real analysis of the data will happen on the client-side, the server
has only to ensure that the time series are up to date and not corrupted
- it doesn't care about the meaning of the value of any given field (I
was probably a little bit unclear about that). It will (with the help of
R - via PL/R) do a few simple calculations and transformations, but
those can be applied without much knowledge about the properties of the
data.

> I do not know if PostgreSQL, or any other RDBMS,
> includes the ability to call on software such as "R"
> to do specific statistical analysis, but if I had to
> do some time series analysis, I would do it in a
> client application that retrieves the appropriate data
> from the database and either does the analysis in
> custom code I have written (usually in C++, as some of
> my favourite analyses have not made it into commonly
> available open source or commercial statistical
> software) or invokes the appropriate functions from
> statistical software I have at my disposal. The
> strategy I describe above makes the SQL required for
> much of this dirt simple.


Exactly, i'm looking for a simple and clean way to store data(possibly
transform its frequency) and serve it to the clients (in different
formats).

A typical request would be "Give me quarterly GDP and quarterly Import
figures for the US between 1987-01 to 2007-2 plus the effective federal
funds rate (transformed to quarterly averages) for the same period".

The server therefore has to grab the values for the first two series
(both are much longer - go back to the 1930's or 40's), extract the
values for the requested period (pad it if it's too short). It will do
the same for the third series and transform it to quarterly values (only
available for months/weeks). Finally it will return a table with a
date-column and 3 columns for the requested data.

Most of the stuff will be done in stored procedures, since i don't want
to mess around with it in php (which acts as as proxy for the clients).

Andreas
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-09-2008, 11:04 PM
Ted Byers
 
Posts: n/a
Default Re: Design Question (Time Series Data)

--- Michael Glaesemann <grzm@seespotcode.net> wrote:

>
> On Oct 4, 2007, at 9:30 , Ted Byers wrote:
>
> > I do not know if PostgreSQL, or any other RDBMS,
> > includes the ability to call on software such as

> "R"
>
> See PL/R:
>
> http://www.joeconway.com/plr/
>

Thanks. Good to know.

Ted


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-09-2008, 11:05 PM
Josh Tolley
 
Posts: n/a
Default Re: Design Question (Time Series Data)

On 10/4/07, Ted Byers <r.ted.byers@rogers.com> wrote:
> --- Michael Glaesemann <grzm@seespotcode.net> wrote:
>
> >
> > On Oct 4, 2007, at 9:30 , Ted Byers wrote:
> >
> > > I do not know if PostgreSQL, or any other RDBMS,
> > > includes the ability to call on software such as

> > "R"
> >
> > See PL/R:
> >
> > http://www.joeconway.com/plr/
> >

> Thanks. Good to know.


See also RdbiPgSQL
(http://bioconductor.org/packages/2.0...RdbiPgSQL.html)

PL/R lets you write R functions as procedural functions you can call
from pgsql (e.g. select my_r_function(myfield) from mytable. RdbiPgSQL
creates R functions you can use to query pgsql from within R.

-Josh/eggyknap

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:46 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525