http://bugs.winehq.org/show_bug.cgi?id=12071
Summary: MSI SQL joins on tables with many rows are extremely slow Product: Wine Version: 0.9.57. Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: msi AssignedTo: truiken@gmail.com ReportedBy: truiken@gmail.com CC: ead1234@hotmail.com
With the current implementation of MSI SQL joins, we perform a Cartesian product on the tables joined together. In the case where there is no WHERE clause, this is the correct output, but if there is a WHERE clause, we still perform the Cartesian product and then filter on the WHERE clause. Though it's uncommon, there are a few installers (Visual Studio, Nero Express 7) that join tables with thousands of rows. For example, say we have tables A, B, and C s.t.
colA colB colC ----- ----- ----- 1 1 1 2 2 2 ... ... ... 1000 1000 1000
and we have the query
SELECT * FROM A, B, C WHERE `colA`=1 AND `colB`=1 AND `colC`=1
then the current implementation will create a new table with those columns containing 1000*1000*1000=1 billion rows. Then we check each of the 1 billion rows for any matches. There are several algorithms for optimizing joins:
http://en.wikipedia.org/wiki/Join_(SQL)#Join_algorithms
The solution I'll be working on is the merge join. This solution parses the WHERE clause, starting with the two tables that the columns of the clause belong to (colA -> A, colB -> B, etc) and joins those together, making sure to eliminate rows based on the WHERE clause. Then the next table in the clause is merged into the previously created table, eliminating rows. This continues until there are no tables left. One optimization of this solution is to start with parts of the WHERE clause that compare against literals. For example, the query:
SELECT * FROM A, B, C WHERE `colA` = `colB` AND `colB` = 1 AND `colC` = 1
We'd start with "`colB` = 1" since the comparison is against a literal. We perform 1000 comparisons on the rows of table B (count = 1000). The resulting table is:
colB ---- 1
then we do the same for table C (count = 1000), and the merged table is:
colB colC ---- ---- 1 1
next we merge this table with table A:
colA colB colC ---- ---- ---- 1 1 1 2 1 1 ... ... ... 1000 1 1
and we search through this table for the condition `colA` = `colB` (count = 1000). So we've reduced billions of comparisons to 3000.
Unfortunately, the way our SQL engine is implemented, this will not be an easy task.