49

Background

I would like to provide the subset of my database required to reproduce a select query. My goal is to make my computational workflow reproducible (as in reproducible research).

Question

Is there a way that I can incorporate this select statement into a script that dumps the queried data into a new database, such that the database could be installed on a new mysql server, and the statement would work with the new database. The new database should not contain records in addition to those that have been used in the query.

Update: For clarification, I am not interested in a csv dump of query results. What I need to be able to do is to dump the database subset so that it can be installed on another machine, and then the query itself can be reproducible (and modifiable with respect to the same dataset).

Example

For example, my analysis might query a subset of data that requires records from multiple (in this example 3) tables:

select table1.id, table1.level, table2.name, table2.level 
       from table1 join table2 on table1.id = table2.table1_id 
       join table3 on table3.id = table2.table3_id
       where table3.name in ('fee', 'fi', 'fo', 'fum'); 
RolandoMySQLDBA
  • 182,700
  • 33
  • 317
  • 520
David LeBauer
  • 3,142
  • 8
  • 30
  • 34

8 Answers8

67

mysqldump has the --where option to execute a WHERE clause for a given table.

Although it is not possible to mysqldump a join query, you can export specific rows from each table so that every row fetched from each table will be involved in the join later on.

For your given query, you would need to mysqldump three times:

First, mysqldump all table3 rows with name in ('fee','fi','fo','fum'):

mysqldump -u... -p... --where="name in ('fee','fi','fo','fum')" mydb table3 > table3.sql

Next, mysqldump all table2 rows that have matching table3_id values from the first mysqldump:

mysqldump -u... -p... --lock-all-tables --where="table3_id in (select id from table3 where name in ('fee','fi','fo','fum'))" mydb table2 > table2.sql

Then, mysqldump all table1 rows that have matching table1_id values from the second mysqldump:

mysqldump -u... -p... --lock-all-tables --where="id in (select table1_id from table2 where table3_id in (select id from table3 where name in ('fee','fi','fo','fum')))" mydb table1 > table1.sql

Note: Since the second and third mysqldumps require using more than one table, --lock-all-tables must be used.

Create your new database:

mysqladmin -u... -p... mysqladmin create newdb

Finally, load the three mysqldumps into another database and attempt the join there in the new database.

mysql -u... -p... -D newdb < table1.sql
mysql -u... -p... -D newdb < table2.sql
mysql -u... -p... -D newdb < table3.sql

In mysql client, run your join query

mysql> use newdb
mysql> select table1.id, table1.level, table2.name, table2.level 
       from table1 join table2 on table1.id = table2.table1_id 
       join table3 on table3.id = table2.table3_id
       where table3.name in ('fee', 'fi', 'fo', 'fum'); 

Give it a Try !!!

WARNING : If not indexed correctly, the second and third mysqldumps may take forever !!!

Just in case, index the following columns:

ALTER TABLE table2 ADD INDEX (table1_id);
ALTER TABLE table2 ADD INDEX (table3_id);
ALTER TABLE table3 ADD INDEX (name,id);

I'll assume id is the primary key of table3.

David LeBauer
  • 3,142
  • 8
  • 30
  • 34
RolandoMySQLDBA
  • 182,700
  • 33
  • 317
  • 520
  • 1
    thanks for the detailed example! I missed the --where clause in the documentation; will let you know how this works after I get a chance to try it out. – David LeBauer Aug 15 '11 at 22:52
  • 1
    +1 I like this better than the --tables method for this problem. In general, I would end up using --tables, but the --where is a very nice option. – Richard Aug 16 '11 at 00:44
  • When you mysqldump a single table, --lock-all-tables is not used. Becasue the where clause involved tables other than the one being dumped, you must tell mysqldump --lock-all-tables. The --lock-all-tables option is active for dumping one or more databases, NOT FOR A SINGLE TABLE. I tried to perform the 2nd and 3rd mysqldumps but it complained about this. Once I manually issued --lock-all-tables, the error went away and the mysqldump was successful. Also, please notice the first mysqldump in my answer does not have --lock-all-tables. – RolandoMySQLDBA Sep 02 '11 at 17:44
  • @Rolando thanks for your help. This worked perfectly – David LeBauer Sep 02 '11 at 18:34
  • @Rolando sorry, I didn't notice that you had answered my comment / question before I deleted it. I was getting the same error. After re-reading the manual, I see --lock-tables only locks tables being dumped. I was confused because --lock-all-tables locks all tables across all databases, which is not necessary when only using a single database. – David LeBauer Sep 02 '11 at 18:48
  • @Rolando is this edit that's pending a good edit? – jcolebrand Sep 02 '11 at 21:06
  • @jcolebrand - I am sorry I think I corrected spelling on just one word. If you can see the difference and it's not much, please revert it back. I am OK with that. – RolandoMySQLDBA Sep 02 '11 at 21:59
  • This saved me a ton of scripting work, thank you! My client deleted a bunch of data, discovered that they did it months later, and I needed to restore it without destroying present data. – Justin Fortier Nov 27 '17 at 16:20
  • @JustinFortier - I am in precisely the same mess. Have you any idea how I can make progress to restore in my case? This time, simple --where conditions may not work, as I'm talking about several layers of deeply cascading FK related data... – Ifedi Okonkwo Dec 09 '17 at 09:35
  • @IfediOkonkwo

    I only had one integer based FK relation on a secondary table that contained metadata. Basically post.id was related to meta.post_id.

    I accomplished this using 2 separate mysql dumps. First dump I just had the missing IDs from post table. The second dump I did a query on the meta.post_id with the same ids, giving me the related metadata only. I then concatenated these dumps together.

    – Justin Fortier Jan 08 '18 at 18:43
  • Keep in mind that SELECT... WHERE IN is way slower than JOIN, so performance here is going to suck. – siliconrockstar Apr 09 '18 at 19:58
  • This is not transactionally safe and thus not a general solution for the problem of selectively dumping specific rows from multiple tables. The tables might change during the dump so you might end up with broken foreign key references. That said, it is still a great tool if the tables are unrelated, or if you are optimistic that there will probably be no related changes during the dump (You could check after import for broken references and repeat the process until everything is consistent). – phobic Jun 04 '18 at 15:07
  • @phobic Of course, it is not transactionally safe on a live system. If there is some replication involved, then you can take a slave, stop replication on that slave, run these crazy dumps, and start replication on that slave, it should be OK. Checking for broken references would still need to be done. – RolandoMySQLDBA Jun 04 '18 at 16:46
8

I would consider using an 'outfile' as part of your SELECT instead of mysqldump to solve this problem. You can produce whatever SELECT statement you want, then append "INTO OUTFILE '/path/to/outfile.csv' ..." at the end with the appropriate configuration for CSV style output. Then you can simply use something like 'LOAD DATA INFILE...' syntax to load the data into your new schema location.

For example, using your SQL:

select table1.id, table1.level, table2.name, table2.level 
       from table1 join table2 on table1.id = table2.table1_id 
       join table3 on table3.id = table2.table3_id
       where table3.name in ('fee', 'fi', 'fo', 'fum')
INTO OUTFILE '/tmp/fee-fi-fo-fum.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
; 

Keep in mind you'll need enough available storage space on the target disk partition.

randomx
  • 3,934
  • 4
  • 30
  • 43
  • I like this for the dataload. You'll still need to get the schema over to the new database, but that's easily achieved using some other tricks. – Richard Aug 16 '11 at 00:46
  • I like this as well because some people may not want the base tables, just the joined result as a single CSV imported. +1 !!! – RolandoMySQLDBA Aug 16 '11 at 01:28
  • @randy Thank you for your answer, but I do not think that this solves my problem because I am not interested in a csv dump of query results. What I need to be able to do is to dump the database subset so that it can be installed on another machine, and then the query itself can be reproducible (and modifiable with respect to the same dataset). The goal is a computational workflow that supports reproducible research. – David LeBauer Aug 16 '11 at 16:21
  • For future readers re David's comment: as Richard mentioned, you need to separately export schema of the tables involved. Those schemas can easily be loaded into a new database. Then, as randomx said, you use Load Data Infile to load that .csv into that new database. Now, the query can be executed. – ToolmakerSteve Apr 06 '19 at 13:33
  • I just realized that the limitation of this technique, is that the query output is not in the same organization as the original tables. While I do still like this approach, to recreate the original table structure: Run separate queries, one per table, to export the data needed for that table. – ToolmakerSteve Apr 06 '19 at 13:42
6

The mysqldump util has a --tables option that lets you specify which tables to dump. It lets you specify the list of tables.

I don't know of any easier (automated) way.

Richard
  • 6,393
  • 8
  • 42
  • 62
  • thank you for your help, but I only want to export the selected rows of each table, not just the required tables. I could have a script that follows the dump with delete from table1 where id not in (.....);, if that is the easiest way, as long as the script can be automated, it is not necessary that the specific tool exists. – David LeBauer Aug 15 '11 at 22:21
  • You deserve a +1 because --tables would be simpler and dropping the unneeded data would be just more horse work in the new server, especially if the tables involved are over 1GB each. Most people would feel a greater level of comfort doing it that way because it just makes sense in terms of the steps. My answer just takes a little planning and a little more risk. – RolandoMySQLDBA Aug 15 '11 at 22:56
4

What was useful for me was something like:

mysqldump -u db_user -p db_name table_name --no_create_info \
--lock-all-tables --where 'id in (SELECT tn.id FROM table_name AS tn \
JOIN related_table AS rt ON tn.related_table_id = rt.id \
WHERE rt.some_field = 1)' > data.sql

From http://krosinski.blogspot.com/2012/12/using-table-join-with-mysqldump.html

Ryan
  • 375
  • 4
  • 12
2

This question is quite old and already has some good answers provided. One possible solution is to use mysqldump with the --where option.

However, in recent years, several new products have been released to help you subset your database. Without delving too much into the details, these products allow you to make the configuration process more declarative, so that newly connected tables and columns are handled automatically. Additionally, you can embed them into your CI/CD pipeline.

I might be biased (I am the CTO at Synthesized), but we have an awesome tool in this category called Synthesized TDK: https://docs.synthesized.io/tdk/latest/. There is a free community version that supports only Open Source databases, so it should work for you!

An example configuration for your case would be (config.yaml):

default_config:
  mode: "KEEP"

tables:

  • table_name_with_schema: "table3" filter: name in ('fee', 'fi', 'fo', 'fum')

table_truncation_mode: TRUNCATE schema_creation_mode: CREATE_IF_NOT_EXISTS

To run the tool, simply execute the corresponding command:java -jar tdk.jar <connection options> -c config.yaml

To learn more about subsetting using a filter, please refer to the documentation: Data Filtering

If you have any questions, please join our community: Slack Community

2

Have you tried the quote function in mysql?

SELECT CONCAT("insert into table4(id,level,name,levelt2) VALUES(",   quote(table1.id),   ",",    quote(table1.level),   ",",    quote(table2.name),   ",",    quote(table2.level),    ");") as q
       from table1 join table2 on table1.id = table2.table1_id 
       join table3 on table3.id = table2.table3_id
       where table3.name in ('fee', 'fi', 'fo', 'fum'); 

save the above, as query.sql

cat query.sql|mysql --skip-column-names --raw > table4.sql
velcrow
  • 139
  • 2
2

i wrote a small script for similar problem, here it is: https://github.com/digitalist/mysql_slice

include ('queryDumper.php');


$exampleQuery="select * from information_schema.columns c1 
left join information_schema.columns c2 on 1=1 limit 1";

//define credentials
$exampleMysqli = new mysqli($host, $user, $password, $database);
$exampleResult=$exampleMysqli->query($exampleQuery);

//if  mysqlnd (native driver installed), otherwise use wrapper
$exampleData=fetchAll($exampleResult);
$exampleMeta=$exampleResult->fetch_fields();

/*
 * field content removal options
 * column name => function name in queryDumper.php, namespace QueryDumperHelpers
 * 
 * */

$forbiddenFields=array(
'password'=>'replacePassword', //change password -> md5("password")
'login'=>'replaceLogin', //change login vasya@mail.ru -> vasya@example.com
'comment'=>'sanitizeComment' //lorem ipsum or 
);


//get tables dump
$dump=(\queryDumper\dump($exampleData, $exampleMeta, $forbiddenFields));



$dropDatabase=true; //default false
$dropTable=true; //default false

$dbAndTablesCreationDump=\QueryDumperDatabaseAndTables\dump($exampleMysqli,$exampleMeta, $dropDatabase, $dropTable);

$databases=$dbAndTablesCreationDump['databases'];
$tables=$dbAndTablesCreationDump['tables'];
$eol=";\n\n";
echo implode($eol, $databases)."\n";
echo implode($eol, $tables).";\n";
echo "\n";

//consider using array_unique($dump) before imploding
echo implode("\n\n", $dump);
echo "\n";
?>

i.e. you have this query:

SELECT * FROM employees.employees e1 
LEFT JOIN employees.employees e2 ON 1=1 
LIMIT 1; 

you got this dump:

DROP DATABASE `employees`;

CREATE DATABASE `employees`;
CREATE TABLE `employees` ( /* creation code */ ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT IGNORE INTO `employees`.`employees` VALUES ("10001","1953-09-02","Georgi","Facello","M","1986-06-26");

INSERT IGNORE INTO `employees`.`employees` VALUES ("10001","1953-09-02","Georgi","Facello","M","1986-06-26");
digitalist
  • 21
  • 2
1

In MySQL:

SHOW CREATE TABLE table1; -- use these two create statements
SHOW CREATE TABLE table2; -- to design table4's create statement
CREATE TABLE table4( .... );
INSERT INTO table4(id,level,name,levelt2)
SELECT table1.id, table1.level, table2.name, table2.level 
   from table1 join table2 on table1.id = table2.table1_id 
   join table3 on table3.id = table2.table3_id
   where table3.name in ('fee', 'fi', 'fo', 'fum'); 

On Command Line:

mysqldump mydb table4 |gzip > table4.sql.gz

On your destination server, setup ~/.my.cnf

[client]
default-character-set=utf8

Import on destination server

zcat table4.sql.gz | mysql
velcrow
  • 139
  • 2
  • as with the answer above https://dba.stackexchange.com/a/4663/1580, this does not generate the separate tables required to reproduce the original database structure. Rather, it creates a joined table of results – David LeBauer Mar 02 '22 at 18:24