The OpenNET Project / Index page

[ новости /+++ | форум | теги | ]



Вариант для распечатки  
Пред. тема | След. тема 
Форум Разговоры, обсуждение новостей
Режим отображения отдельной подветви беседы [ Отслеживать ]

Оглавление

Утечка исходных текстов браузера Opera 12.15, opennews (??), 14-Янв-17, (0) [смотреть все] +1

Сообщения [Сортировка по времени | RSS]


300. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от a.non (?), 28-Янв-17, 16:58 
> another minipatch, for content blocker UI: don't add «http://» for patterns like
> «||example.com/*», and replace simple «example.com»
> to «||example.com/*».
> http://dpaste.com/21AF0F9

Better to make it "||example.com^*" to match "example.com:80" too.

Please note, that it will match "http://foo.com/example.com/" even though it is not supposed to.

Ответить | Правка | К родителю #228 | Наверх | Cообщить модератору

301. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от arisu (ok), 28-Янв-17, 17:12 
> Please note, that it will match "http://foo.com/example.com/" even though it is not
> supposed to.

yeah, you are right. tnx, this is something i will fix eventually. i didn't read the code deep enough. my bad.

Ответить | Правка | Наверх | Cообщить модератору

302. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от Аноним (-), 28-Янв-17, 18:36 
"*^" is bugged too. "*/foo" and "http://example.com^foo" match "http://example.com/foo", but "*^foo" does not.
Ответить | Правка | Наверх | Cообщить модератору

303. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от arisu (ok), 28-Янв-17, 18:51 
> "*^" is bugged too. "*/foo" and "http://example.com^foo" match "http://example.com/foo",
> but "*^foo" does not.

yeah. they did matching optimization by searching for the first character from the pattern (after the «*»), and failed to consider that «^» has a special meaning there.

actually, the whole wildcard matching code is crappy (althru not a cosmic horror ;-), and should be rewritten.

Ответить | Правка | Наверх | Cообщить модератору

304. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от Аноним (-), 29-Янв-17, 00:19 
|| and ^ were added late, so they might cause bugs anywhere.

I've checked the code, and it looks like the pipe can break the hashing code, because it's not in the IsCharacterWildCard(). For example, "||com/" does not match "http://example.com/".

The content_filter could be improved a lot in many interesting ways.

All the extensions with ad-block subscriptions take a very long time to load and update filters. I haven't tried to debug them, but I suspect that it is the API being very slow to load filters one-by-one and not the extensions themselves.

I also suspect that the matching algorithm can be improved, especially for the huge lists. /content_filter/tests/data/hugelist.ini has 4k lines... fanboy-ultimate.txt is over 90k (~50k without CSS rules). Currently they seem to check every filter's hash for the given URL in the FilterURLList::Find(). One beautiful optimization would be to store most rules starting with || in a hash table with the domain as a key (with 2-level domain minimum and, say, 4-level maximum). That should make matching over that set faster, especially since you often have to check multiple URLs with the same domain. More general rules could also use something of [ url_length * log(filter_list_length) ] complexity, although I'm not sure what's possible there and how would it compare to the current approach.

The API and format could be improved too. Caching hashes between browser restarts and differential list updates come to mind.

Ответить | Правка | Наверх | Cообщить модератору

305. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от arisu (ok), 29-Янв-17, 00:41 
yeah, you are absolutely right. the extension API for content blocker is freakin' slow, and the matching itself can be improved alot. it is also possible to build some «preselect» tables to quickly reject most filters, and painlessly process megabytes of data from urlfilter.ini. ;-) something like boyer-moore, but for filters. i believe that adblock+ and uBo both are doing something like that; and we have somewhat easier task at hand, 'cause we don't need to analyze full regexps.
Ответить | Правка | Наверх | Cообщить модератору

309. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от Аноним (-), 29-Янв-17, 16:59 
> something like boyer-moore, but for filters.

Well, before implementing custom substring search algorithm for URLFilter::MatchUrlPattern(), I would try to convert the pattern to a RegExp. It should be faster than current naive checker and as fast as any other custom algorithm. RegExps could be compiled on demand and cached.

For the preselection something that utilizes common filtering patterns could be used. The "domain as a dictionary lookup key" should take care of about 3/4 rules.

Additionally, some lookup by path parts as keys could be devised. Google does something similar: https://developers.google.com/safe-browsing/v4/urls-hashing#... .

Both of these lookups could be enabled based on the number of qualifying patterns.

For the rest, the current FilterURLList::Find() can be used. I don't think it differs significantly from the "shortcuts comparison" described here: https://adblockplus.org/en/faq_internal#filters . But either algorithm could be further improved, for example, by sorting those hashes/shortcuts.

Ответить | Правка | Наверх | Cообщить модератору

310. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от arisu (ok), 29-Янв-17, 17:20 
i'd better avoid using regexps in content blocker: they aren't *that* good in early rejecting, but slower and more memory-demanding. i think it is better to calculate b-m-like «substring presence» tables for patterns, and b-m table for url before checking. b-m table for urls can be cached in the content blocking module, btw, but i don't think that it is really worth the efforts: «domain as key» will take care of most rejects, i believe.
Ответить | Правка | Наверх | Cообщить модератору

311. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от Аноним (-), 29-Янв-17, 17:53 
It's worth a shot. To be clear, I mean that the pattern should be converted into regexp inside the URLFilter::MatchUrlPattern(), after all the preselection steps. I have no idea how much overhead would regexp creation introduce. And its speed would depend hugely on the regexp engine. But don't forget that Boyer-Moore doesn't come for free either.
Ответить | Правка | Наверх | Cообщить модератору

312. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от arisu (ok), 29-Янв-17, 18:11 
> It's worth a shot.

yeah. it is never too late to revert all the changes anyway. ;-) i'm not trying to stop you from implementing that, just thinking out loudly.

> To be clear, I mean that the pattern
> should be converted into regexp inside the URLFilter::MatchUrlPattern(), after all the
> preselection steps. I have no idea how much overhead would regexp
> creation introduce.

quite a lot. any decent regexp engine compiles expression into internal bytecode of some kind (possibly building kickstart and quick reject tables too), so it is better to at least cache created regexps.

> And its speed would depend hugely on the regexp engine.

here i think we should use the one that is used by Opera Core itself (modules/regexp). it is even JITed, as far as i can see.

> But don't forget that Boyer-Moore doesn't come for free either.

yea, but it is fast enough for small strings. and we can switch to b-m codepath only if we have some «big enough» amount of matches to check (after all possible early rejections).

Ответить | Правка | Наверх | Cообщить модератору

307. "Утечка исходных текстов браузера Opera 12.15"  +/
Сообщение от VladSh (?), 29-Янв-17, 15:11 
This discussion is going on there: http://forum.timsky.ru/viewtopic.php?f=7&t=35&sid=709e173737...
Ответить | Правка | К родителю #304 | Наверх | Cообщить модератору

Архив | Удалить

Рекомендовать для помещения в FAQ | Индекс форумов | Темы | Пред. тема | След. тема




Партнёры:
PostgresPro
Inferno Solutions
Hosting by Hoster.ru
Хостинг:

Закладки на сайте
Проследить за страницей
Created 1996-2024 by Maxim Chirkov
Добавить, Поддержать, Вебмастеру