I'm wondering if someone could help me optimize my tables/query to speed up a query. It is currently running ridiculously slow. I think a well-thought out index could help me. Any help would be really appreciated
Tables URLS and TAGS mentioned below are 2 and 20 million rows respectively (will probably end up having 10x). A query like the one below already takes 10 seconds to run.
An Example: http://whatrethebest.com/php+tutorials
Tables
CREATE TABLE IF NOT EXISTS `TAGS` (
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `my_unique_key` (`hash`,`tag`),
KEY `tag` (`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and
CREATE TABLE IF NOT EXISTS `URLS` (
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`md5`),
UNIQUE KEY `md5` (`md5`),
KEY `numsaves` (`numsaves`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
QUERY
SELECT urls.md5, urls.url, urls.title, urls.numsaves
FROM urls
JOIN tags ON urls.md5 = tags.hash
WHERE tags.tag
IN (
'php', 'tutorials'
)
GROUP BY urls.md5
HAVING COUNT( * ) =2
ORDER BY urls.numsaves DESC
LIMIT 20
EXPLAIN
I'm not sure what this shows
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tags range my_unique_key,tag tag 767 NULL 230946 Using where; Using index; Using temporary; Using filesort
1 SIMPLE urls eq_ref PRIMARY,md5 PRIMARY 767 jcooper_whatrethebest_urls.tags.hash 1
So I think the problem is:
certain tags like 'php have 34,000 entries, most of which only have under 5 saves. But in order to get the 20 most saved it is having to sort them all.Right?
I can't really create a 'numsaves' column in TAGS and index on that because that number will be changing up and down, and that wouldnt make sense. Is it possible to create a cross-table index between urls.numsaves and tags.tag? Or a third table to use in my query somehow? Would this solve my problem? I know almost nothing about indexing.
Any help would be really appreciated!
EDITS BELOW
RESPONSE TO YperCube:
Thank you, Thank you, your suggestions have sped up my queries by a factor of 10-20X . This is an immense improvement. I can't thank you enough.
I'm posting my current queries and tables with execution times in case you or anyone else has any more optimization suggestions. I am worried that as my table grows I may not be able to keep my search times under 3 seconds, which would be a killer.
New Query Example 1
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = u.id
AND t2.tag = 'language'
ORDER BY u.numsaves DESC
LIMIT 20
Showing rows 20 - 19 ( 20 total, Query took 0.2009 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref tag_id tag_id 767 const 53820 Using where; Using index; Using temporary; Using filesort
1 SIMPLE t1 ref tag_id tag_id 772 const,jcooper_whatrethebest_urls.t2.url_id 1 Using where; Using index
1 SIMPLE u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t2.url_id 1
Neq Query Example 2 (seems to be slower)
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN
( SELECT ui.id, ui.numsaves
FROM urls AS ui
JOIN tags AS t1 ON t1.url_id = ui.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = ui.id
AND t2.tag = 'language'
ORDER BY ui.numsaves DESC
LIMIT 20
) AS ulim ON ulim.id = u.id
ORDER BY ulim.numsaves DESC ;
Showing rows 0 - 29 ( 2,794,577 total, Query took 0.4633 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 20 Using filesort
1 PRIMARY u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 ulim.id 1
2 DERIVED t2 ref tag_id tag_id 767 53820 Using where; Using index; Using temporary; Using filesort
2 DERIVED t1 ref tag_id tag_id 772 jcooper_whatrethebest_urls.t2.url_id 1 Using where; Using index
2 DERIVED ui eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t2.url_id 1
Using Query Example on a Single Tag (slower by a lot)
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
ORDER BY u.numsaves DESC
LIMIT 20
Showing rows 20 - 19 ( 20 total, Query took 3.7395 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 ref tag_id tag_id 767 const 200576 Using where; Using index; Using temporary; Using filesort
1 SIMPLE u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t1.url_id 1
I'm not sure why this one is so much slower?
Do you have any ideas of a query to optimize for querying a single tag?
My Current Tables
CREATE TABLE `urls` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
KEY `id_numsaves_IX` (`id`,`numsaves`)
) ENGINE=InnoDB AUTO_INCREMENT=2958560 DEFAULT CHARSET=utf8
CREATE TABLE `tags` (
`url_id` int(11) DEFAULT NULL,
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `tag_id` (`tag`,`url_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Thank you again