This is a discussion on Representation for Heterogeneous Attribute Set within the SQL Server forums, part of the Microsoft SQL Server category; --> My company is working on a bond derivative portfolio analysis tool and we're facing a problem that I did ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| My company is working on a bond derivative portfolio analysis tool and we're facing a problem that I did not see adequately addressed any where in literature. I really did RTFM. I'm very experienced in relational modelling (10+ years) so this is not a case of not understanding the principles. Here is the problem stripped of irrelevant context. The problem below is simplified for the sake of the example so don't sweat the details. THE PROBLEM 1. There are many types of bonds, each type has a different set of attributes, different attribute names, different attribute datatypes. For example, bond A has two variables: a yearly interest rate and date of issue, B has five variables: an interest rate and 4 specific dates on which various portions of principal need to be paid, bond C has a set of 4 variables: interest rate in period 1, interest rate in period 2, the date on which the bond can be put back to the issuer, and two dates on which the bond can be called by the issue. And so on. So, on the first attempt I could represent each bond type as its own table. For example, create table bond_type_a (rate INTEGER, issue_date DATE) create table bond_type_b (rate INTEGER, principle_date1 DATE, principle_date2 DATE, principle_date3 DATE, principle_date4 DATE) create table bond_type_c (rate1 INTEGER, rate2 INTEGER, put_date DATE, call_date DATE) This is the nice relational approach but it does not work because: 2. There are many thousands of bond types thus we would have to have many thousands of tables which is bad. 3. The client needs to be able construct the bond types on the fly through the UI and add it to the system. Obviously, it would be bad if each new type of bond created in the UI resulted in a new table. 4. When a user loads the bond portfolio it needs to be very fast. In the table per type approach if a user has a 100 different types if bond in the portfolio you would have to do 100 joins. This is a heavily multi user environment so it's a non-starter. It's impossibly slow. THE SOLUTIONS So now that we ditched the table per bond type approach we can consider the followiing solutions (unpleasant from the relational point of view): 1. Name-Value pairs. create table bonds (bond_id INTEGER, bond_type INTEGER, attribute_id INTEGER, value VARCHAR(255)) Comment: The client does not like this approach because they want to run various kinds of reports and thus they doe not want the values to be stored as VARCHAR. They want the DB to enforce the datatype. 2. Typed Name-Value pairs. create table bonds (bond_id INTEGER, bond_type INTEGER, attribute_id INTEGER, int_val INTEGER, string_val VARCHAR(255), date_val DATE_ Comment: The client does not like this because the table is sparse. Every row has two empty fields. 3. Link table with table per data type. create table bonds (bond_id INTEGER) create table bond_int_data (bond_id INTEGER REFERENCES bonds(bond_id), value INTEGER) create table bond_string_data (bond_id INTEGER REFERENCES bonds(bond_id), value VARCHAR(255)) create table bond_date_data (bond_id INTEGER REFERENCES bonds(bond_id), value DATE) Comment: This meets most of the requirements but it just looks ugly. 4. Dynamic Mapping create table (bond_id INTEGER, int_val1 INTEGER, int_val2 INTEGER, date_val1 DATE, date_val2 DATE, string_val1 VARCHAR(255), string_val2 VARCHAR(255)) Then you have to add some dynamic mapping in your code which will provide bond specific mapping (say, stored in an XML file). For example, For bond_A: yearly_rate maps to int_val1, issue_date maps to date_val1 For bond_C: rate1 maps to int_val1, rate2 maps to int_val2, put_date maps to date_val1, call_date maps to date_val2) Comment: This is very good for performance because when I load a portfolio of different bond types I can pull them all in in one SELECT statement. However this approach has a problem that the table is sparse. The number of fields of each type has to be as high as to accmodate the most complex bond while simple bonds will only be using two or three. THE QUESTIONS: Are the four approaches I described above exhaustive? Are there any other that I overlooked? |
| |||
| <robertbrown1971@yahoo.com> wrote in message news:1108102001.898621.256240@l41g2000cwc.googlegr oups.com... > My company is working on a bond derivative portfolio analysis tool and > we're facing a problem that I did not see adequately addressed any > where in literature. I really did RTFM. I'm very experienced in > relational modelling (10+ years) so this is not a case of not > understanding the principles. Here is the problem stripped of > irrelevant context. The problem below is simplified for the sake of the > example so don't sweat the details. > [major snip] Robert, Normally, when I see a question like yours, I recommend doing a google search on "relational model generalization". However, I think, from the examples you provide, that you already know the things most people learn from such a search. You might want to do the search anyway, just to cross check your conclusions. What follows is, I am afraid, not terribly helpful. I write it in the hope that it can lead somewhere constructive. The nut of the problem, it seems to me, is the requirement that users can define new types of bonds on the fly. In classical relational design, a data analysis of the subject matter reveals all the "entities" and "relationships between entities" that are in the universe of discourse. From there, adding the attributes and domains that describe all of the relevant data values is straightforward, if tedious. The next step is to design a relational model. In this step a fairly limited, and very stable set of tuple types is derived. The tuple types are comprehensive, in the sense that they permit stating all the facts in the universe of discourse in terms of the data domains. You only need to add a new tuple type when the universe of discourse expands. The next step is to build a physical model, and build the database. When that's done, we generally find that adding a new instance of an existing tuple type can be done by inserting new data. Creating a new tuple type, on the other hand, requires altering the metadata. That's classical relational design and development. From your outline of the problem, I'm pretty sure you already know everything I've said up to this point, even if you would have phrased it differently. Now, here's the rub: Can you create a new bond type without creating a new tuple type? I think the answer is no. Can you create a new tuple type without altering metadata? I think the answer is no. Can you trust end users with the management of metadata? I think the answer is no. The usual solution, which you've called "dynamic" is to disguise metadata as data, by storing it in user tables. That may be the best you are going to do. But it still means that your users are managing metadata, whether they know it or not. This is not the answer, but it's worth giving some thought. |
| |||
| <robertbrown1971@yahoo.com> wrote in message news:1108102001.898621.256240@l41g2000cwc.googlegr oups.com... <snip> > THE QUESTIONS: > > Are the four approaches I described above exhaustive? Are there any > other that I overlooked? > I hesitate to mention this, but in case you are open to using data models other than relational, this is a perfect application for tools that permit weak typing, among other features. If you were to model this application in XML documents, for example, ignoring dtds and xsds for now, you will see that the pattern to the data that makes sense in your mind and the minds of the users can also make sense to a database product. It simply isn't a good match for today's RDBMS implementations, nor for a strongly typed persistence engine. --dawn |
| |||
| Do the auditors and the feds know that anyone can invent a new kind of security in your system? I doubt it. The classic scenario calls for a root class with all the common attributes and then specialized sub-classes under it. As an example, let's take the class of Vehicles and find an industry standard identifier (VIN), and add two mutually exclusive sub-classes, Sport utility vehicles and sedans ('SUV', 'SED'). CREATE TABLE Vehicles (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) NOT NULL CHECK(vehicle_type IN ('SUV', 'SED')), UNIQUE (vin, vehicle_type), ..); Notice the overlapping candidate keys. I then use a compound candidate key (vin, vehicle_type) and a constraint in each sub-class table to assure that the vehicle_type is locked and agrees with the Vehicles table. Add some DRI actions and you are done: CREATE TABLE SUV (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT 'SUV' NOT NULL CHECK(vehicle_type = 'SUV'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Vehicles(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ..); CREATE TABLE Sedans (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL CHECK(vehicle_type = 'SED'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Vehicles(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ..); I can continue to build a hierarchy like this. For example, if I had a Sedans table that broke down into two-door and four-door sedans, I could a schema like this: CREATE TABLE Sedans (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL CHECK(vehicle_type IN ('2DR', '4DR', 'SED')), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Vehicles(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ..); CREATE TABLE TwoDoor (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT '2DR' NOT NULL CHECK(vehicle_type = '2DR'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Sedans(vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ..); CREATE TABLE FourDoor (vin CHAR(17) NOT NULL PRIMARY KEY, vehicle_type CHAR(3) DEFAULT '4DR' NOT NULL CHECK(vehicle_type = '4DR'), UNIQUE (vin, vehicle_type), FOREIGN KEY (vin, vehicle_type) REFERENCES Sedans (vin, vehicle_type) ON UPDATE CASCADE ON DELETE CASCADE, ..); The idea is to build a chain of identifiers and types in a UNIQUE() constraint that go up the tree when you use a REFERENCES constraint. Obviously, you can do variants of this trick to get different class structures. If an entity doesn't have to be exclusively one subtype, you play with the root of the class hierarchy: CREATE TABLE Vehicles (vin CHAR(17) NOT NULL, vehicle_type CHAR(3) NOT NULL CHECK(vehicle_type IN ('SUV', 'SED')), PRIMARY KEY (vin, vehicle_type), ..); Now start hiding all this stuff in VIEWs immediately and add an INSTEAD OF trigger to those VIEWs. |
| |||
| Joe Celko, I actually bought and studied back to back your book "Trees and Hierarchies in SQL" in search of the solution to this problem. The book is great and is the only one on the market (that I could find) that deals head-on with the common problem of hierarchy represnetation in SQL. I learned about the nested set model, path enumeration, id-parentid etc. It has earned a permanent place on my reference book shelf but unfortunately it does not address the problem I outlined in this post. I'm not being faceteous here - I really did like the book a lot. I am also well familiar with the post you provided in reponse to my question. I studied it in the archives and I understand it. In fact, I used the approach you outlined elsewhere several times. I also own your "SQL for smarties book" which is also very good. Now to address your specific questions: -CELKO- wrote: > Do the auditors and the feds know that anyone can invent a new kind of > security in your system? I doubt it. Robert replies: This is not a trading system. It is an ANALYSIS tool where the analyst can invent new bonds and other instruments and see how they behave and how they affect the portfolio behaviour. A common way to analyze hedges for a portfolio is to try out different possibilities to see which ones work in a computer simulation. Then once you find the right one - then you see if there are existing combinations of securities (bonds, call/put options, floating bonds, commodity linked bonds etc) that can approximate the behaviour of the ideal hedge you modelled. -CELKO- wrote: > The classic scenario calls for a root class with all the common > attributes and then specialized sub-classes under it. As an example, > let's take the class of Vehicles and find an industry standard > identifier (VIN), and add two mutually exclusive sub-classes, Sport > utility vehicles and sedans ('SUV', 'SED'). Robert repies: 1) This is the generalization hierarchy approach. It does not work in this case since the subtypes are quite numerous and can be created dynamically. Your beef was primarily with dynamic creation of entities. That's actually not the only issue. Suppose we ditch the dynamic creation you still have the problem that you have thousands of tables, one for each bond type. If your portfolio has 800 bonds of 100 different types then you need to do a 100 joins when the user pulls up the portfolio. The performance of that is far worse than that of the aesthetically ugly name-value approach. 2) The generalization hiearachy makes sense *only* when many attributes are shared and only some attributes are specific to the subtype. That is not the case here there is very little overlap in attributes among the different bond. The only attribute they all have in commong is perhaps maturity but even here there are exceptions. For example one might think all debt instruments would have an interest rate. Not the case, there are zero coupon bonds. You might think every debt instrument has a maturity. Not the case, the maturity can be conidtional. And so on. 3) When wrote the post I suspected, as Dawn Walthius wrote, that this problem is simply not amenable to relational modelling. As several posters noted this use case can be very easily represented in an OO system or in an hierarchical persistence such as XML. I do not see a good solution for this in relational terms. It's a choice among imperfect alternatives. > CREATE TABLE Vehicles > (vin CHAR(17) NOT NULL PRIMARY KEY, > vehicle_type CHAR(3) NOT NULL > CHECK(vehicle_type IN ('SUV', 'SED')), > UNIQUE (vin, vehicle_type), > ..); > > Notice the overlapping candidate keys. I then use a compound candidate > key (vin, vehicle_type) and a constraint in each sub-class table to > assure that the vehicle_type is locked and agrees with the Vehicles > table. Add some DRI actions and you are done: > > CREATE TABLE SUV > (vin CHAR(17) NOT NULL PRIMARY KEY, > vehicle_type CHAR(3) DEFAULT 'SUV' NOT NULL > CHECK(vehicle_type = 'SUV'), > UNIQUE (vin, vehicle_type), > FOREIGN KEY (vin, vehicle_type) > REFERENCES Vehicles(vin, vehicle_type) > ON UPDATE CASCADE > ON DELETE CASCADE, > ..); > > CREATE TABLE Sedans > (vin CHAR(17) NOT NULL PRIMARY KEY, > vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL > CHECK(vehicle_type = 'SED'), > UNIQUE (vin, vehicle_type), > FOREIGN KEY (vin, vehicle_type) > REFERENCES Vehicles(vin, vehicle_type) > ON UPDATE CASCADE > ON DELETE CASCADE, > ..); > > I can continue to build a hierarchy like this. For example, if I had a > Sedans table that broke down into two-door and four-door sedans, I > could a schema like this: > > CREATE TABLE Sedans > (vin CHAR(17) NOT NULL PRIMARY KEY, > vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL > CHECK(vehicle_type IN ('2DR', '4DR', 'SED')), > UNIQUE (vin, vehicle_type), > FOREIGN KEY (vin, vehicle_type) > REFERENCES Vehicles(vin, vehicle_type) > ON UPDATE CASCADE > ON DELETE CASCADE, > ..); > > CREATE TABLE TwoDoor > (vin CHAR(17) NOT NULL PRIMARY KEY, > vehicle_type CHAR(3) DEFAULT '2DR' NOT NULL > CHECK(vehicle_type = '2DR'), > UNIQUE (vin, vehicle_type), > FOREIGN KEY (vin, vehicle_type) > REFERENCES Sedans(vin, vehicle_type) > ON UPDATE CASCADE > ON DELETE CASCADE, > ..); > > CREATE TABLE FourDoor > (vin CHAR(17) NOT NULL PRIMARY KEY, > vehicle_type CHAR(3) DEFAULT '4DR' NOT NULL > CHECK(vehicle_type = '4DR'), > UNIQUE (vin, vehicle_type), > FOREIGN KEY (vin, vehicle_type) > REFERENCES Sedans (vin, vehicle_type) > ON UPDATE CASCADE > ON DELETE CASCADE, > ..); > > The idea is to build a chain of identifiers and types in a UNIQUE() > constraint that go up the tree when you use a REFERENCES constraint. > Obviously, you can do variants of this trick to get different class > structures. > > If an entity doesn't have to be exclusively one subtype, you play with > the root of the class hierarchy: > > CREATE TABLE Vehicles > (vin CHAR(17) NOT NULL, > vehicle_type CHAR(3) NOT NULL > CHECK(vehicle_type IN ('SUV', 'SED')), > PRIMARY KEY (vin, vehicle_type), > ..); > > Now start hiding all this stuff in VIEWs immediately and add an INSTEAD > OF trigger to those VIEWs. |
| |||
| "David Cressey" <david.cressey@earthlink.net> wrote in message news:7J2Pd.9180$oO.2621@newsread2.news.atl.earthli nk.net... > Can you trust end users with the management of metadata? I think the answer > is no. To be continued? Can the end user to be trusted entering primary key attribute? Can the end user be trusted remembering the password? Can concurrency control rely on end users being able to commit or rollback their transactions in time? |
| |||
| "David Cressey" <david.cressey@earthlink.net> wrote in message news:7J2Pd.9180$oO.2621@newsread2.news.atl.earthli nk.net... > Now, here's the rub: Can you create a new bond type without creating a new > tuple type? I think the answer is no. > Can you create a new tuple type without altering metadata? I think the > answer is no. > Can you trust end users with the management of metadata? I think the answer > is no. > > The usual solution, which you've called "dynamic" is to disguise metadata as > data, by storing it in user tables. > That may be the best you are going to do. But it still means that your > users are managing metadata, whether they know it or not. Seriously, why end-users managing metadata is a bad idea? Is security compromised? Can users wreck a havoc on the database schema? Even with properly designed roles and privileges? Also, isn't the line between data and metadata somewhat artificial? |
| |||
| > Are there any other that I overlooked? RM has a scope which works best with data where groups of things have similar attributes. When modelling data that is highly variable, one method is generic modelling (ie everything in a few tables, possibly even one) however it quickly becomes impractical due to the impact on performance and diminished SQL support. There is no practical solution for this type of problem using RM. The core problem is a limitation of the RM data model and to a lesser degree limited processing power. The script shown below for an experimental db models a few bonds. The experimental db allows each thing to have a variable number of classifications, attributes, values and yet things are automatically normalized by the db. Another such example posted at www.xdb2.com/example/ex108.asp // Create items in directory to classify things (CREATE *bond.item ~in = dir) (CREATE *yearly rate.item ~in = dir) (CREATE *amount.item ~in = dir) (CREATE *principle portion payment.item ~in = dir) (CREATE *period.item ~in = dir) (CREATE *date.item ~in = dir) (CREATE *due date.item ~in = dir) (CREATE *issue date.item ~in = dir) (CREATE *start date.item ~in = dir) (CREATE *end date.item ~in = dir) (CREATE *put back date.item ~in = dir) (CREATE *call by issue date.item ~in = dir) // Create bond A with a yearly rate and an issue date (CREATE *bondA.cls = bond) (CREATE bondA.yearly rate = +5%) (CREATE bondA.issue date = (CREATE *1/1/05.cls = date _ & it.cls = issue date)) // Create bond B with a yearly rate // and 2 dates on which a portion of principal is due (CREATE *bondB.cls = bond) (CREATE bondB.yearly rate= +6%) (CREATE bondB.principle portion payment = _ (CREATE *.cls = principle portion payment _ & it.amount = +$345 _ & it.due date = (CREATE *1/1/11.cls = date _ & it.cls = due date))) (CREATE bondB.principle portion payment = _ (CREATE *.cls = principle portion payment _ & it.amount = +$456 _ & it.due date = (CREATE *1/1/12.cls = date _ & it.cls = due date))) // Create bond C with different rates in 2 periods, // one put back date and two call by issue dates. (CREATE *bondC.cls = bond) (CREATE bondC.period = _ (CREATE *.cls = period _ & it.yearly rate = +8% _ & it.start date = (CREATE *1/1/05.cls = date _ & it.cls = start date) _ & it.end date = (CREATE *1/31/05.cls = date _ & it.cls = end date))) (CREATE bondC.period = _ (CREATE *.cls = period _ & it.yearly rate = +9% _ & it.start date = (CREATE *2/1/05.cls = date _ & it.cls = start date) _ & it.end date = (CREATE *2/28/05.cls = date _ & it.cls = end date))) (CREATE bondC.put back date = (CREATE *1/1/25.cls = date _ & it.cls = put back date)) (CREATE bondC.call by issue date = _ (CREATE *7/1/25.cls = date _ & it.cls = call by issue date)) (CREATE bondC.call by issue date = _ (CREATE *9/1/25.cls = date _ & it.cls = call by issue date)) // Find bonds that has a period that earns 9% yearly // and has a call by issue data of 9/1/25. // Finds bond C. (SELECT *.cls = bond _ & *.period = (*.yearly rate = 9%) _ & *.call by issue date = 9/1/25) |
| |||
| > 1. Name-Value pairs. > > create table bonds (bond_id INTEGER, bond_type INTEGER, attribute_id > INTEGER, value VARCHAR(255)) > > Comment: The client does not like this approach because they want to > run various kinds of reports and thus they doe not want the values to > be stored as VARCHAR. They want the DB to enforce the datatype. One possible approach is to use a Variant datatype. In MSSQL, the "sql_variant" datatype can hold any non-BLOBs. Then add a DATA_TYPE column where: null is date, 0 is varchar, and 1 is integer. Use a trigger to ensure that the VALUE can be explicitly casted to the specified data type. CREATE TABLE bonds (bond_id INTEGER, bond_type INTEGER, attribute_id INTEGER, data_type BIT, value SQL_VARIANT) |
| ||||
| I ran into a similar problem with a Rent and Royalties business domain about 20 years ago. I found that instead of the typical attribute modeling approach, such as in Joe Celko's example, define the appropriate specialized sub-classes based on the algorithms. Here is a simple example with Loans where the Interest Calculation method changes over time. Create table Loans ( Loan_PrimaryKey , StartDate, TermYears , ....) -- For the Mortgage Interest Algorithms where the InterestMethod is either Fixed or Variable at any point in time Loan_Interest ( Loan_PrimaryKey , StartDate , InterestMethod ) -- Here is the sub-type for the Fixed algorithm: Loan_Interest_Fixed ( Loan_PrimaryKey , StartDate , InterestRate ) -- Here is the sub-type for the Variable algorithm: Loan_Interest_Variable ( Loan_PrimaryKey , StartDate, InterestMarginRate, InterestIndex_PrimaryKey ) Similary, there would be algorithm tables for each of the other Loan algorithm factors such as for Balloon Payments, Payment Frequency, Insurance, etc. Each of these could have their sub-classes which could have sub-classes. These same structures could be used for regardless if a Loan is for a Home Mortgages, Home Equity or Line of Credit. Hope this Helps. |