I have two databases with the same schema, and I want to efficiently execute diff in one of the tables. That is, they return only unique records, discounting the primary key.
columns = zip(*db1.execute("PRAGMA table_info(foo)").fetchall())[1]
db1.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db1.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
db2.execute("ATTACH DATABASE '/path/to/db1.db' AS db1")
db2.execute("ATTACH DATABASE '/path/to/db2.db' AS db2")
data = db2.execute("""
SELECT
one.*
FROM
db1.foo AS one
JOIN db2.foo
AS two
WHERE {}
""".format(' AND '.join( ['one.{0}!=two.{0}'.format(c) for c in columns[1:]]))
).fetchall()
That is, ignoring the primary key (in this case meow), do not return records that are identical in both databases.
Table fooin is db1as follows:
meow mix please deliver
1 123 abc
2 234 bcd two
3 345 cde
And the table fooin db2looks like this:
meow mix please deliver
1 345 cde
2 123 abc one
3 234 bcd two
4 456 def four
Thus, unique entries from db2:
[(2, 123, 'abc', 'one'), (4, 456, 'def', 'four')]
which is what I get. This works great if I have more than two columns. But if there are only two of them, that is, the primary key and a value such as in the search table:
bar baz bar baz
1 123 1 234
2 234 2 345
3 345 3 123
4 456
, N-1 , N , N - db1. , , , .
[(1, '234'),
(1, '234'),
(2, '345'),
(2, '345'),
(3, '123'),
(3, '123'),
(4, '456'),
(4, '456'),
(4, '456')]
, :
N = db1.execute("SELECT Count(*) FROM foo").fetchone()[0]
data = [
list(data)
for data,n in itertools.groupby(sorted(data))
if np.mod(len(list(n)),N)==0
]
:
[[4, '456']]
, SQL-, .
, ( db ~ 10k) . ? !