Content filtering in Otter

Topic: Content filtering in Otter (Read 3643 times)

Content filtering in Otter

2014-05-26, 16:16:45

Opera allowed to block addresses by specifying a list of patterns to block. The annoying thing about that was, that it wasn't possible to specify exceptions from the patterns. Thus, some rules simply could not be specified. Also, each pattern was treated as a complete address string. You could not specify, for instance, that only the server name should be checked for the pattern. Example: You want to block everything from server "foo.com" but not block "yourproxy.com/foo.com". You couldn't specify this in Opera.

I hope that these limitations won't be in Otter. As a first easy solution, it should be possible to specify both inclusion and exclusion rules, which are applied in the order of appearance, just like filters are specified in httrack (http://www.httrack.com/). Thus, if "-" means "exclude" and "+" means "include", it should be possible to specify something like:

+"*" -- include everything
-"*foo.com*" -- but exclude "foo.com"
+"*yourproxy.com/*" -- but include, for instance, "yourproxy.com/foo.com"

Exact syntax is debatable, but in the above filter example, you might have guessed that "*" stands for "arbitrary length wildcard" and "--" for comments. However, the simple fact, that the rules are checked in order of appearance, means, that the filter would already be much more powerful than Opera's.

Implementing some way to specify rules that only apply to certain structural elements of the URL string, like domain or host names, could be delayed to the future.

Re: Content filtering in Otter

Reply #1 – 2014-05-26, 16:38:54

@Somebody, so far it seems that the best approach would be to be compatible with AdBlock, that would allow users to use lots of already existing rules.
I guess that it should be good enough, we could extend it if needed, preferably in cooperation with them, so it would stay compatible in future.
It could be possible to have more than one interpreter (for example support rules from classic Opera) or let import other formats.

Here is ticket in issue tracker:
https://github.com/Emdek/otter/issues/206

So far it's not yet decided how these rules should be stored, but probably we will stick to plain text files, not SQLite.

Re: Content filtering in Otter

Reply #2 – 2014-05-26, 19:50:14

I don't know the AdBlock format but if several interpreters are possible, I'd like to see at least one simple thing like Opera's urlfilter.ini. I opened that file from Opera's profile folder and it looks like this:

; some commentary
[prefs]
prioritize excludelist=1
[include]
*
[exclude]
pattern-1
pattern-2
..
pattern-n

The patterns are optionally followed by "=UUID:crypticsequence", which is concatenated to each pattern after you edit the rules from within Opera itself and save them. I don't know what the UUID is used for. If omitted, the filter still works. If you replace the "prioritize excludelist=1" with "prioritize excludelist=0", every URL that matches any pattern from the "[include]" block is allowed, regardless if the pattern also matches any pattern from the "[exclude]" block, and at the same time, patterns not included in the "[include]" block count as filtered out. This is rather useless, since all you can do with this scheme is to either allow everything or allow only certain patterns (whitelisting, since everything else is blocked). In Opera itself, you cannot change the "[include]" patterns at all, and the predefined pattern is "*". You can only add patterns that you want to exclude.

I suggest, that the useless "[prefs]" block be omitted and that the "[include]" and and "[exclude]" blocks are evaluated in order of appearance and that a sequence of "[include]" and "[exclude]" blocks be allowed. For instance, Opera itself does not care if an additional "[include]" block follows the "[exclude]" block, leading to the problem that you cannot specify exceptions to your "[exclude]" patterns. In the proposed scheme, an entered URL would be checked against the patterns in the corresponding blocks in the order of their appearance and treated accordingly. If a URL matches an "[exlude]" pattern, the URL is filtered and the engine can jump to the next "[include]" block to check whether an explicit inclusion rule overrides the previous exclusion rule and so on. That way, it would be possible to exclude "someurl.net" but to include "proxy.net/someurl.net" by specifying this properly. Note, that the example in my previous post was not very well chosen, since translated to the scheme suggested here, it would read:

[include] ; first rule block
*
[exclude] ; second rule block
*foo.com*
[include] ; third rule block (may include URLs previously excluded)
*yourproxy.com*

The last block would actually also allow, for instance, "http://foo.com/somefoo/yourproxy.com", which is an unintended edge case, although "foo.com" would still be blocked in most cases (those, that don't contain "yourproxy.com"; you get the idea). You'd need to be careful with your rule specifications to cover all edge cases. That is why I said that it would ultimately also be desirable to specify filters for structural URL parts, where the filter would only target, for instance, the domain name, not the whole text string. This would be a convenience to avoid having to craft very complicated to decypher regular expressions. I don't think that a simple filter engine should actually implement full regular expression parsing, since that seems to be overkill for most cases. If at all, such an engine could still be implemented later.

As an addition, further block types could be introduced, like "[warn]", which could mean that the user shall be warned if some URL matches any pattern (and show which one) specified therein. This could be nice for debugging, if a user is uncertain whether some of his rules lead to collateral damage like blocking URLs that he didn't intend to block. Or possibly for some other purposes that you might come up with. These extra block types would, however, be merely "nice to have" and could be implemented later.

In the suggested scheme, whitelisting would be implemented as follows:

[exclude] ; everything is excluded,
*
[include] ; except the URLs that you explicitly specify
pattern-1
pattern-2
..
pattern-n

Re: Content filtering in Otter

Reply #3 – 2014-05-26, 20:35:22

@Somebody, it's technically possible to have multiple interpreters, but best approach would be to only allow to import existing filters and have single list in memory. It would work faster and would be easier to maintain.

Here is documentation for AdBlock rules:
https://adblockplus.org/en/filters
They do support exception rules. :-)

Instead of rules raising warnings I would go for informing about each applied rule through Error Console (maybe adding separate category).
Also I would like to have action allowing to disable content blocker for specific tab (I've too often seen error page informing me that requested URL was blocked and I should remove rule to visit it...) and maybe sites (using site specific preferences), to allow to show embedded content that should be blocked on other sites.

Re: Content filtering in Otter

Reply #4 – 2014-05-26, 21:33:42

I guess, it could be made selectable, which interpreter (and thus, which rule syntax) to use. That way, only one list would have to be kept in memory. I agree that it is not necessary to have multiple engines active at the same time right from the beginning. A modular approach for engine integration would be useful, so that people could plug in their own interpreters, if they want to experiment.

The warning-raising rule block was just a quick idea. Having some notice about blocked content on the site currently loaded would be useful (maybe show it for a few seconds in some corner). Still, it would be nice, if certain blocked content could be made "more visible" than other content, so there should be some way to mark those rules that you want to be more visible. URLs of blocked content marked for presentation in such a "more visible" way could perhaps be printed in bold or somesuch, when shown in an error console.

I agree that it would often be useful to have tabs with modified or disabled content blocker. I have certain sites that I usually block, but which I occasionally want to visit nonetheless. For that, I suggest named filter sets, i.e., have a way to give names to your filter rule sets. Which rule set to apply for a given tab could than be selected from a popup menu. If you wanted to maintain only one list in memory, however, you would have to think about extension rule sets, that would be concatenated to the list in memory and used for the tabs that use them. I think, the approach I presented in my previous post would already be good for that scheme, since pattern blocks would be checked in their order of appearance. Say, you normally want to block "example.com" and therefore have this rule in your standard rule set:

[exclude]
*example.com*

Let's say, you now want to allow "example.com" in some tab but keep all your other rules intact. You could have this in an extension rule set:

[include]
*example.com*

As you can see, if you concatenate this to your standard rule set and evaluate the rules in their order of appearance, "example.com" gets first excluded only to be finally included again. Not very efficient, but it is only for your current tab where you want to have that exception. The alternative to extension rule sets would be to have completely independent rule sets. In the end, there are probably use cases for both. So you should ideally be able to see and select in the popup menu for your tab both the independent rule sets and the extension rule sets and be able to load exactly one independent rule set and an arbitrary number of extension rule sets per tab to have full flexibility. To distingish the sets, their names could be shown in different colours or with some prefix prepended. So, in the popup menu, you could find, for instance (where "--" be a comment):

Standard -- your standard rule set
Fancy -- some alternative rule set, that you have defined
[E] allow foo.com -- some extension rule set, where you explicitly allow "foo.com"; "[E]" denotes "extension rule set"
[E] allow blablub -- some other extension rule set, where you allow or deny some other patterns

The standard rule set would be selected by default for each tab. If you unselect the "Standard" entry in the popup menu, content blocking would be disabled entirely for the affected tab.

Re: Content filtering in Otter

Reply #5 – 2014-05-27, 09:24:09

@Somebody, most likely there won't be built-in support to change interpreter, it's not something that would be useful enough to make it pluggable by default. But if someone would like to experiment then it should be easy to replace it and simply recompile.
It's doable to import rules from Opera to format used by AdBlock, probably the same applies to other formats (at least partially).

Named filter sets could be done (it's part of AdBlock infrastructure), but such fine-tuning would probably make more sense for site specific preferences (which sets should be applied), for quick toggling for tab it should be enough to simply disable them all. Context menu to select applied rules per tab could be reconsidered later, it's not that hard to do. ;-)

It's important to not over complicate it, we could use the same widget to select list of active sets and their priorities, global and per site.

This is how Error Console looks right now (when docked, by default, could be detached to behave more like that from classic Opera):
http://im9.eu/picture/wb2016
I would like to add an option to let it show results only from active tab, then it would be easier to check which content blocking rules got applied to displayed page.

Re: Content filtering in Otter

Reply #6 – 2014-05-27, 20:48:00

OK, I've now looked into the AdBlock filter documentation. The rules presented there seem to make writing a filter rule file easy enough. Will Otter implement its own interpreter for these rules? How and when will URLs be checked against the rules in a filter file? I wonder how element hiding will work. Does Otter fetch the HTML file from the server, does some work on it, and then feeds the processed character stream to the rendering engine, or does WebKit do more work on its own?

Since such a thing as element hiding already exists in the AdBlock rules, I wonder whether this could be extended to some degree. Looking at HTTrack's filtering options (http://www.httrack.com/html/filters.html), I can see how, for instance, file-size based filtering or hiding could be added somehow to the AdBlock rules. However, AdBlock and HTTrack have different purposes. Filters based on file size might not be that relevant for an ad blocker.

However, what currently serves only content filtering could be of more general utility, if each URL has to be checked anyway. The rule matching could also be used to trigger special handling of certain URLs. For instance, users could specify that certain addresses must only be called via HTTPS instead of HTTP. I know that I have to be careful with my web mailer, lest I log in over the unencrypted interface. Since both encryted and unencrypted login sites are provided under the same URL, this would free me from having to manually type the protocol prefix, since it would default to HTTP, if I didn't.

Going further this path, it should also be possible to specify the secure protocol and cyphers to be used for individual sites. Currently, this cannot be done on a per-site basis in Opera, leading to the situation that some pages don't load, if you disable certain cyphers or protocol versions, while at the same time, some sites select cyphers in the order in which they appear in Opera's list, although the servers would actually support more recent cyphers, if the browser reports the older ones as unsupported to the server.

So I wonder where the best place would be to configure such things in the browser. It could be in the site specific settings, but having some way to specify rule based triggers seems to be attractive nonetheless.

Re: Content filtering in Otter

Reply #7 – 2014-05-28, 06:26:13

@Somebody, URLs will be checked in NetworkManager instance, CSS based hiding rules will be applied using user stylesheet.
All requests from QtWebKit got through that instance, it's possible to modify returned contents and change parameters of each requests (for example modify headers).
Size based rules could be explored later, but I would prefer to avoid extensions, so it won't break if upstream will add similar syntax with different meaning. ;-)
Forcing HTTPS is doable, however in some cases it might lead to infinite loop, if site will redirect back to HTTP.
Setting ciphers by URL should doable too, I'm not sure about protocols, API allows to set only one value from these:
http://qt-project.org/doc/qt-5/qssl.html#SslProtocol-enum