Avoiding The Query From Hell

Filed under: General — admin at 9:29 pm on Saturday, August 16, 2008

The “query from hell” is the DBA’s nightmare. It is a query that occupies the entire resource of a machine and effectively runs for ever, never finishing in any time-scale that is meaningful. Naturally, it is a query to be avoided. Given the vast quantity of data in a data warehouse, such queries are all too possible to generate. The problem is that it may be impossible to tell the difference between an acceptable query that takes up to 48 hours to run and one that will run for much longer and effectively never finishes. Unless it is possible to measure exactly how far a query has progressed and how much more processing it has left to do, it will be impossible to predict when a query will end. The only way of preventing such queries is by controlling exactly how much resource a query can have.

This can be done in two ways. First, profiles can be used to limit the amount of resource a user process can use. When that amount is exceeded, the query will be terminated automatically. This will not prevent a query running, but it will stop it running for ever using vast amounts of resources. The advantages of profiles are that they are automatic, and they require no external control. The disadvantage is that they kick in only after a lot of resource has already been wasted. Second, queuing of queries can be used to control their resource usage. If all . queries must be submitted via the query manager, then the degree of parallelism and other resources of the query can be controlled. Any query that is likely to run for extended periods of time can be limited in what resource it will use. This will mean that the query will take longer to run, but it will stop a single query interfering with the other processing necessary to maintain and run the data warehouse.

Experimenting using Prototypes

Filed under: General — admin at 12:03 am on Friday, August 15, 2008

Typically, organizations will experiment with the concept of data analysis and educate themselves on the value of a data warehouse, prior to determining that a data warehouse is the appropriate solution. In practice, many organizations address this by sponsoring an initial prototyping activity, which is used to further understanding of the feasibility and benefits of a data warehouse. This activity is valuable, and should be considered if this is the organization’s first exposure to the benefits of decision support information.

In some instances, the data warehouse may be the first large-scale client-server solution being implemented within the organization, and will require new skills, experiences, hardware, etc. A prototyping activity on a small scale can further the educational processes as long as the prototype addresses a clearly defined technical objective; the prototype can be thrown away once the feasibility of the concept has been shown - that is, it does not become “the first bit of the data warehouse”; the activity addresses a small subset of the eventual data content of the data warehouse and the activity timescale is non-critical that is, it is seen as a time boxed effort to come to grips with the new technologies being considered.

Unlike prototypes, working models suffer in that they have a tendency to set the expectation that they will grow to the full data warehouse. This may be inappropriate in practice, because the architecture of the model may not scale up to a full data warehouse. If the requirement is to produce an early release of part of a data warehouse, in order to deliver business benefits, we suggest that you focus on the business requirements and technical blueprint phases and understand the short and medium term requirements of the data warehouse.

Understanding the Case for Investment

Filed under: General — admin at 11:59 pm on Monday, August 4, 2008

In a data warehousing project, it is critical that the warehouse manager understands the business case for investment. This is necessary to understand the level of investment that can be justified, as hard choices will have to be made regarding the scope and size of the solution. The purpose of the business case is to identify the projected business benefits that should be derived from using the data warehouse. These benefits mayor may not be quantifiable (for example, a $5000000 saving per annum), but the projected benefits should be clearly stated. Data warehouses that do not have a clear business case tend to suffer from credibility problems at some stage during the delivery process.

Many organizations justify building data warehouses as “an act of faith.” Without exception, these projects have foundered or have been subject to very searching questions within six months of project inception.

Data Warehouse Delivery Method

Filed under: General — admin at 11:52 pm on Friday, August 1, 2008

The delivery method is essentially a variant of the joint application development approach, adapted for delivery of data warehouses. The entire delivery process is staged in order to minimize risk, by producing production-quality deliverables, which are designed to grow to a full solution. Our experience has been that most enterprise data warehousing projects tend to have a development cycle of between 18 and 24 months from start to end. This is clearly difficult to justify; because the required investment is substantial, and no business benefits can be generated until the system is delivered. Clearly, an approach that allows the business to deliver “quick wins” is more attractive.

The approach discussed within this section will not reduce the overall delivery time-scales; however, it does ensure that business benefits are delivered incrementally through the development process. This means that the cost of investment can be offset and measured against the benefits being generated by the data warehouse. If the projected benefits are not being achieved in practice, further investment can be halted, or the project can be redirected to a different business need, minimizing the overall project risk. Before we discuss the delivery process, bear in mind that this method is designed to deliver an enterprise data warehouse, not a point solution. Enterprise data warehouses provide information about many (if not all) aspects of information being used by the business. They are not small databases (that is, 50 GB) or data marts being used to address a specific functional need. Enterprise data warehouses provide an accurate and consistent view of enterprise information, regardless of which point solutions they may be addressing at a specific point in time.

Designing To The Query Requirements

Filed under: General — admin at 9:24 pm on Tuesday, June 24, 2008

To address the challenge of designing the data warehouse to be query efficient even for ad hoc access, techniques such as star schemas, denormalization, query parallelism and data partitioning are used to make the data warehouse as query efficient as possible. The aim is to design and layout the fact data to make it as accessible and query efficient as it can be. The partitioning, while used primarily for ease of management, should also be designed for performance, with data partitioned in business meaningful chunks. Data is denormalized to avoid costly joins between huge tables. The fact data and its dimension data are organized into some variant of a star schema to allow star query access between the dimensions and the fact table. Finally, queries are parallelized to allow them to process such vast quantities of data in a realistic time-scale.

The data layout is then designed to take all this into account. The fact and dimension data are placed on disk to minimize the possibilities of any I/O bottlenecks and to maximize parallel access to the fact data. The irony is that, having gone to such great lengths to make the fact data maximally accessible, the next step is to ensure that the users never access it. You should aim to have 90% or more of all queries answered by access to aggregations. A query that runs against an aggregation will avoid the sort and aggregation work that it would have to do if it was run against the fact data. As sort is one of the most costly operations, and one that does not scale linearly as the data grows, the savings made by avoiding sorts are significant.

Scheduling of Reports

Filed under: General — admin at 9:22 pm on Saturday, June 21, 2008

With proper scheduling these reports can be run extremely quickly using high degrees of parallelism. This allows the full power of the machine to be brought to bear on these reports, and the batch reports can be finished quickly. The canned queries are also predefined queries, but they differ from reports in that there is a requirement to run them online. Canned queries also differ from reports in that they are often parameterized, and hence the data set they visit can vary radically in size. As with the batch reports these queries are a good starting point for testing and development. You can measure the resource requirements of these queries, and the results can be used for capacity planning and for database design.

The canned queries are again good candidates for use on a pilot system, or for early delivery in a phased development. Ad hoc queries, as the name suggests, are the unpredictable element of a data warehouse. They are also generally the main reason for developing the data warehouse in the first place. It is exactly that ability to run any query when desired and expect a reasonable response that makes the data warehouse worthwhile, and makes the design such a significant challenge. The ad hoc query profile will be difficult if not impossible to predict. The best that can be done is to develop an understanding of the queries that are likely to be run. This will come from an understanding of the business and from the requirements capture. This information can then be used in the design of the database to meet those requirements.

Detailed Analysis or Simple Analysis?

Filed under: General — admin at 9:21 pm on Saturday, June 14, 2008

It is vital that you get this right, because the decision on what constitutes the fact data for the data warehouse will depend on these answers. This is one of the most important decisions you will make in the whole project. If you get the level of detail wrong for the fact data you may ultimately have to scrap the whole data warehouse and start again. Just to reinforce this statement, and to ensure that it is not taken as a throwaway comment, if you get the level of detail wrong for the fact data you will probably have to scrap the whole data warehouse and start again. The whole design, the sizing, the capacity planning and so on will be based on this decision. If the level of detail is incorrect, the hardware will be the wrong size, the database design and layout will be incorrect, and any partitioning you are using will be wrong. You will need to start over.

If you are aware that, as in the example above, the level of detail will increase later, you can design with that in mind, and ensure that any design you put in place now should not hamper that change later. Having an understanding of these details will allow you to round out the query requirements. It allows the user requirements to be rationalized, and a reasonable estimate of the sorts of queries that are possible in the future to be calculated.

Importance of capturing right Information

Filed under: General — admin at 9:19 pm on Monday, June 9, 2008

If you are a data warehouse manager, remember to capture all the time, period and business date information by department, as different parts of a business may have different dates and times. This information is crucial to the data warehouse design, because it can for example radically affect your weekend processing if some parts of the business have their week begin on Sunday and others have their week start on Monday.

Part of analyzing and understanding the business is getting to grips with the key business indicators, and the data dimensions that are important. What dimensions or data fields will the users want to summarize and query on? These fields will form the basis of all your aggregations, and this information - in conjunction with the date and period information discussed above-will allow you to define the base set of aggregations that will be required.

As part of the analysis you also need to drive out the meaningful levels of data detail. For example, if the business is a food retailer, does it need tb keep detailed records down to the sale of each individual tin of baked beans, or is it sufficient to. summarize sales of baked beans by tin size by brand by day?

Key Periods

Filed under: General — admin at 9:18 pm on Thursday, May 29, 2008

The key periods are the important reporting and financial periods. Does the business have special processing each week, each month, each quarter and so on? Some businesses are not interested in information in the short term, such as daily, and will tend to analyze data over weekly or larger periods. This may be because a given day’s business takes several days to process through the operational systems, with the data becoming available to the data warehouse over the period of a week or longer. It could be that the business does not do daily analysis because it never had the processing power to do it. You need to understand the reasons and the requirements. Just because they do not analyze daily data now, possibly because they have not been able to, it does not mean they will never want to analyze data on a daily basis.

Key business dates need to be documented. The start of the company’s business year, the start of the local tax year and so on are key dates, and will have effects on the processing load on the data warehouse. Some companies have standard business holidays when the business closes down for a number of days or weeks.

Analyzing The User Query Requirements

Filed under: General — admin at 2:59 pm on Saturday, May 24, 2008

One of the key steps in the analysis stage is ascertaining the user query requirements. Before the data warehouse layout can be designed you need a clear understanding of the queries that are likely to be run. Although there will probably be some canned queries or standard reports that can be identified, it is unlikely that the users will have a clear definition of everything they want.
Even if the users are unusually forthcoming, and can give chapter and verse on their requirements, it is unlikely that this will be the whole story. There may be other users who are to be added later, and there is no guarantee that their queries will be the same as those of the current users.

As the users get used to the data warehouse and its abilities, they may begin to explore data in different ways. Remember, it is the ad hoc nature of a data warehouse that makes it what it is. As the users’ requirements change over time, the data warehouse will evolve, but the database design needs to remain constant. So, when performing the analysis requirements capture, it is vital to gain an understanding of the business, and the business sector it occupies. This is the one constant: even if the business changes, it is unlikely that It will change so radically that it moves business sector. A telco is not likely to become a bank, and an airline is unlikely to become a retail chain. In the unlikely event that such a change does occur, the data warehouse will probably need to be designed again from scratch in any case.