0

We are designing a reporting solution for survey results. Although datasets are reasonable in size (rarely more than 500.000 respondents and 50 questions), performance is obviously a major concern.

Due to the nature of the solution, most queries return aggregated values and no locks are needed.

Storing answers in a "normal" tabular format (i.e. a column for each question and a row for each respondent) works well in terms of performance, and allows us to query the data like so:

SELECT COUNT(*) FROM Answers WHERE Gender = 'M' and Age < 20 

However, this design requires a new table for each survey as the questions (columns) differ, which is obviously not an ideal solution.

Therefore, we are considering a design where we store answer data in a table that would basically just hold a respondent ID, a question ID and an answer value, thereby "transposing" the data (i.e. there would be a row for each respondent/question combination in the Answers table).

In this design, we would have to use exists conditions (or joins) to filter our data, e.g:

SELECT COUNT(*) 
FROM Answers AS A1 
WHERE A1.QuestionID = 'Gender' AND A1.VALUE = 'M' AND EXISTS
( 
    SELECT * 
    FROM Answers AS A2 
    WHERE A2.RespondentID = A1.RespondentID AND A2.QuestionID = 'Age' AND A2.Value > 18
)

This would allow as to handle any survey without changing the database schema but we are concerned about what the impact might be on performance?

Or perhaps there is a better way to deal with this issue altogether?

Rasmus B
  • 3
  • 2
  • It is true that the one row per respondent and question is very flexible, but it will increase the complexity of processing and maintaining any needed constraints. Another question: Is every column of the same type of data (e.g Yes/No, A/B/C/D, freetext, or...) or can the answers be of all types? – RLF Aug 28 '13 at 15:26
  • Do you report on more than one survey? For example, when you ask WHERE Gender = 'M' and Age < 20, are you wanting the count of all surveys or just one particular survey? If it is only one particular survey then one table per survey is the appropriate answer. – Greenstone Walker Aug 29 '13 at 00:05
  • We do not combine results from multiple surveys in our reports, so the "one table per survey" design is feasible. – Rasmus B Aug 29 '13 at 07:29
  • Types of answer data vary. Most are small integers (say 1-5) while some are variable length text (rarely more than a few 100 characters, but we need to support a few 1000 at least). We do not need to maitain any constraints as the dataset is never updated (we are simply importing an SPSS file with "static" survey data). – Rasmus B Aug 29 '13 at 07:33
  • Your proposed design sounds like an Entity-Attribute-Value schema. Check this for opinions on how this works: http://dba.stackexchange.com/a/20763/10832 – Hannah Vernon Sep 26 '13 at 04:17

1 Answers1

0

We did some performance tests and found that - with scenarios that are typical for our application domain - it is 10-20 times faster to query data when stored in a regular "tabular" format.

Rasmus B
  • 3
  • 2