[cc-devel] Interesting problem

Discussion:

Maarten Zeinstra

2013-09-27 14:57:28 UTC

Hi list,

Creative Commons Netherlands host our own explanation of the CC-licenses on http://creativecommons.nl/uitleg/ We have links to all CC-licenses there and we license our entire site under a CC BY 3.0

I now got a mail with a question why people need to attribute Creative Commons Netherlands when they want to use CC BY 3.0. It turns out that the metadata scrapers sees the website license and adds the extra metadatat to the deed page.

How come this does not happen at https://creativecommons.org/licenses/ and wow would I be able to avoid this?

Cheers,

Maarten

--
Kennisland | www.kennisland.nl | t +31205756720 | m +31643053919 | @mzeinstra

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ibiblio.org/pipermail/cc-devel/attachments/20130927/b602b563/attachment.html

Mr. Puneet Kishor

2013-09-27 15:27:08 UTC

Permalink

Post by Maarten Zeinstra
Hi list,
Creative Commons Netherlands host our own explanation of the CC-licenses onhttp://creativecommons.nl/uitleg/ We have links to all CC-licenses there and we license our entire site under a CC BY 3.0
I now got a mail with a question why people need to attribute Creative Commons Netherlands when they want to use CC BY 3.0. It turns out that the metadata scrapers sees the website license and adds the extra metadatat to the deed page.
How come this does not happen at https://creativecommons.org/licenses/ and wow would I be able to avoid this?

This is almost impossible to answer without actually seeing the specific scraper that is making the mistake of combining the un-ported CC BY metadata with the CC BY NL metadata. That said, when I identify the RDFa on the page using W3C's N3 bookmarklet, I get

<http://creativecommons.nl/uitleg/>
<?> "article" ;
<?> "Uitleg bij de Creative Commons licenties" ;
<?> "http://creativecommons.nl/uitleg/" ;
<?> "Van "Alle rechten voorbehouden" naar "Sommige rechten voorbehouden" Creative Commons biedt auteurs, kunstenaars, wetenschappers, docenten en alle andere creatieve makers de vrijheid om op een flexi..." ;
<?> "Creative Commons Nederland" ;
<?> "Loading Image...

" ;
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/nl/> ;
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/nl/> ;
<http://creativecommons.org/ns#attributionURL> <http://www.creativecommons.nl/> ;
<http://creativecommons.org/ns#attributionName> "Creative Commons Nederland" .

Doing the same on http://creativecommons.org/licenses/ give

<http://creativecommons.org/licenses/>
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/> .

<http://creativecommons.org/>
<http://creativecommons.org/ns#attributionURL> <http://creativecommons.org/> ;
<http://creativecommons.org/ns#attributionName> "this site" ;
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/> .

As you can see, the first one has the URI to the NL version, while the latter points to the unported version.

--
Puneet Kishor
Science and Data at Creative Commons

Mike Linksvayer

2013-09-28 00:12:50 UTC

Permalink

On Fri, Sep 27, 2013 at 8:27 AM, Mr. Puneet Kishor

Post by Mr. Puneet Kishor

Post by Maarten Zeinstra
Creative Commons Netherlands host our own explanation of the CC-licenses onhttp://creativecommons.nl/uitleg/ We have links to all CC-licenses there and we license our entire site under a CC BY 3.0
I now got a mail with a question why people need to attribute Creative Commons Netherlands when they want to use CC BY 3.0. It turns out that the metadata scrapers sees the website license and adds the extra metadatat to the deed page.
How come this does not happen at https://creativecommons.org/licenses/ and wow would I be able to avoid this?

It doesn't happen on https://creativecommons.org/licenses/ because
that links to non-https deeds, so referer isn't sent, so scraper can't
do anything. (There are other problems, see below, which would make it
not work anyway.)

This also points to an easy immediate solution for you: link to https
versions of the deeds from the .nl site, which is only served over
http. Referer won't get sent.

The only general fix I can think of would be for the scraper to be
more conservative than it is -- look for bare (ie, not objects of
license statements) license urls, and if there's one that's the same
as a license url that is object of a license statement, don't add
anything to the deed, because there's no way of telling which one the
user clicked on.

Post by Mr. Puneet Kishor
This is almost impossible to answer without actually seeing the specific scraper that is making the mistake of combining the un-ported CC BY metadata with the CC BY NL metadata. That said, when I identify the RDFa on the page using W3C's N3 bookmarklet, I get
<http://creativecommons.nl/uitleg/>
<?> "article" ;
<?> "Uitleg bij de Creative Commons licenties" ;
<?> "http://creativecommons.nl/uitleg/" ;
<?> "Van "Alle rechten voorbehouden" naar "Sommige rechten voorbehouden" Creative Commons biedt auteurs, kunstenaars, wetenschappers, docenten en alle andere creatieve makers de vrijheid om op een flexi..." ;
<?> "Creative Commons Nederland" ;
<?> "http://creativecommons.nl/wp-content/uploads/2009/09/Schermafbeelding-2012-12-10-om-14.07.28.png" ;
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/nl/> ;
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/nl/> ;
<http://creativecommons.org/ns#attributionURL> <http://www.creativecommons.nl/> ;
<http://creativecommons.org/ns#attributionName> "Creative Commons Nederland" .
Doing the same on http://creativecommons.org/licenses/ give
<http://creativecommons.org/licenses/>
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/> .
<http://creativecommons.org/>
<http://creativecommons.org/ns#attributionURL> <http://creativecommons.org/> ;
<http://creativecommons.org/ns#attributionName> "this site" ;
<http://www.w3.org/1999/xhtml#license> <http://creativecommons.org/licenses/by/3.0/> .
As you can see, the first one has the URI to the NL version, while the latter points to the unported version.

This is quite broken. I see from archive.org it has been in place
since September of 2012. Maybe someone would've noticed if it caused
obviously incorrect behavior, rather than just sitting there being
wrong, or maybe nobody cares about metadata. ;)

Anyway:
* Kind of silly for every page on the site to make statements about
the CC home page
* The parser you're using seems to be appending / after the hostname,
but I'm not sure if that can be counted on -- no trailing / is
specified in the page, which means the subject will never match the
referer, even when clicking from the home page, as the referer from
the home page will always be http[s]://creativecommons.org/ (note
trailing slash)
* The CC homepage isn't the attribution URL most useful to users --
providing it won't get directly back to the material of interest,
unless that just happens to be the CC homepage. (This applies to the
CC NL attributionURL above as well.)
* "this site" as the attribution name is plain silly
* Fixes could include removing the about property, changing
attributionURL to "" (ie current page), and rewording so that
attributionName can be "Creative Commons", or just skip that
annotation

Mike

Mr. Puneet Kishor

2013-09-28 00:20:21 UTC

Permalink

Post by Mike Linksvayer
or maybe nobody cares about metadata. ;)

s/maybe nobody/almost maybe nobody/

Sad but probably true. On the other hand, having it right is still important so that the minority that *does* care about the metadata is given the right information.

--
Puneet Kishor
Science and Data at Creative Commons

Mike Linksvayer

2013-09-28 00:24:26 UTC

Permalink

On Fri, Sep 27, 2013 at 5:20 PM, Mr. Puneet Kishor

Post by Mr. Puneet Kishor

Post by Mike Linksvayer
or maybe nobody cares about metadata. ;)

s/maybe nobody/almost maybe nobody/
Sad but probably true. On the other hand, having it right is still important so that the minority that *does* care about the metadata is given the right information.

There seems be some missing information here in order to draw that conclusion.

But I write to say I made a mistake -- since September 2011, not 2012.

Mike

Maarten Zeinstra

2013-09-30 13:15:09 UTC

Permalink

Hi Mike,

Thanks for the insights, I didn't realise https doesn't sent referrers. Seems logical though.

I linked to the Https versions of the licenses now. It was interesting that a user only now saw this after years of the links being like that. Probably they don't care like Puneet says.

It seems like you are proposing a good solution, however I would first like to see how many times we enrich the deed pages per month to see if it is being used at all. I hardly see an enriched page. Mainly because recently I tighten my browsers privacy with Ghostery, HTTPSEverywhere and AdBlockPlus. If many users do this than this whole metadatascraper idea is dead.

I don't know if I totally agree with your statement that creative commons.org or .nl is a bad attributionURL. If they are reusing the work, than the work is itself visible in its reuse and original context might not matter. Do you think an AttributionUrl should be the same as a source url?

Cheers,

Maarten
--

Post by Mike Linksvayer
On Fri, Sep 27, 2013 at 8:27 AM, Mr. Puneet Kishor

Post by Mr. Puneet Kishor

It doesn't happen on https://creativecommons.org/licenses/ because
that links to non-https deeds, so referer isn't sent, so scraper can't
do anything. (There are other problems, see below, which would make it
not work anyway.)
This also points to an easy immediate solution for you: link to https
versions of the deeds from the .nl site, which is only served over
http. Referer won't get sent.
The only general fix I can think of would be for the scraper to be
more conservative than it is -- look for bare (ie, not objects of
license statements) license urls, and if there's one that's the same
as a license url that is object of a license statement, don't add
anything to the deed, because there's no way of telling which one the
user clicked on.

This is quite broken. I see from archive.org it has been in place
since September of 2012. Maybe someone would've noticed if it caused
obviously incorrect behavior, rather than just sitting there being
wrong, or maybe nobody cares about metadata. ;)
* Kind of silly for every page on the site to make statements about
the CC home page
* The parser you're using seems to be appending / after the hostname,
but I'm not sure if that can be counted on -- no trailing / is
specified in the page, which means the subject will never match the
referer, even when clicking from the home page, as the referer from
the home page will always be http[s]://creativecommons.org/ (note
trailing slash)
* The CC homepage isn't the attribution URL most useful to users --
providing it won't get directly back to the material of interest,
unless that just happens to be the CC homepage. (This applies to the
CC NL attributionURL above as well.)
* "this site" as the attribution name is plain silly
* Fixes could include removing the about property, changing
attributionURL to "" (ie current page), and rewording so that
attributionName can be "Creative Commons", or just skip that
annotation
Mike

Mike Linksvayer

2013-09-30 18:53:46 UTC

Permalink

Post by Maarten Zeinstra
Thanks for the insights, I didn't realise https doesn't sent referrers. Seems logical though.

Browsers aren't supposed to send a referrer where the link is on an
insecure page and the target is a secure page.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec15.html#sec15.1.3

But sending a referrer is always at the option of the client, and in
my experience, referrer isn't sent going from insecure->secure either.
I don't guarantee this will always work. :)

Post by Maarten Zeinstra
I linked to the Https versions of the licenses now. It was interesting that a user only now saw this after years of the links being like that. Probably they don't care like Puneet says.
It seems like you are proposing a good solution, however I would first like to see how many times we enrich the deed pages per month to see if it is being used at all. I hardly see an enriched page. Mainly because recently I tighten my browsers privacy with Ghostery, HTTPSEverywhere and AdBlockPlus. If many users do this than this whole metadatascraper idea is dead.

HTTPSEverywhere generally defeats the scraper, as most license links
are to http://creativecommons... and HTTPSEverywhere either causes the
referrer to be dropped and/or confuses the scraper. Should be possible
to mitigate this by always using https for license deeds, including
providing https urls for links. CC should probably do this.

I don't know that AdBlockPlus does anything with referrer; Ghostery
may, I haven't used it in a long time in favor of
https://disconnect.me/ which I admit I haven't looked at whether it
does anything with referrer.

Another problem is that the scraper will probably miss anything using
modern RDFa (1.1 Lite), which is also a bit less fragile due to at
least not requiring a namespace declaration for common CC usecases. If
the scraper is useful at all it really ought be updated to support
this. Same for the HTML provided with the chooser and documentation.
And I think it makes sense to be neutral about formats and also look
for microdata and microformats annotations.

Those two things (https deed urls, rdfa 1.1 lite & co
support/publishing/documentation) I expect would make the probably
tiny fraction of deeds enriched go up a bit, but more important and
complementary is getting more large sites/widely used software to
publish and consume the annotations. For example Flickr did (may
still) add some RDFa to photo pages, but it was always somewhat
broken. On the consumption side, which is more important IMO, the deed
scraper is it; the intention (again from my perspective) was to close
to the loop, introducing, a, any consumer, so that the annotations had
*some* visibility, hopefully spurring more (but that spurring requires
a lot more finishing, documentation, evangelism that we never got to
for the most part). I haven't followed it closely at all, but maybe
some of Jonas Oberg's work will push in that direction, whether it is
ever reflected in the CC deeds or not.

There was at least discussion several years ago of logging
scraper-scraped metadata so that we could analyze its usage. I don't
remember whether that was set up, but certainly the analysis was never
done. That'd be another thing that could be done, if CC wanted to.

Some info can also be gleaned by crawling the web, or analyzing
others' crawls. I took a look at some low-hanging fruit in that regard
awhile back, and it didn't look great ...
http://gondwanaland.com/mlog/2012/01/23/attribution-crawl/

Post by Maarten Zeinstra
I don't know if I totally agree with your statement that creative commons.org or .nl is a bad attributionURL. If they are reusing the work, than the work is itself visible in its reuse and original context might not matter. Do you think an AttributionUrl should be the same as a source url?

Yes. Consider how much less useful the web would be if you could only
link to a site, not a page within a site. Practice of linking to the
homepage of a site that a resource is on rather than the resource
itself is crippling as an attribution url in exactly that way.

If that's too handwavy, consider that you remixed one of my images,
and link to my homepage as attribution. The intent of the license (I
would have used CC0, but generic "I"...) is that the third party can
take advantage of the license offered by me in the original work. If
they have to dig around on my site to find the work instead of
directly going to it, this advantage is substantially diminished.

Mike

Maarten Zeinstra

2013-10-01 12:58:23 UTC

Permalink

Hi Mike,

Thanks for the extensive reply.

It indeed looks like the scraper needs a thorough rebuild to be properly used in todays internet. One solution to bypass this is to create a tool that works on the licensor page, not on the deed page. That way we would be able to use the info on that page and not look for referrers. Which you could argue is something the Creative Commons should perhaps never do in the first place..

You are right on the attributionURL, but from a practical position it makes it more difficult to have a license in the footer of your page right. Because now I need to add logic instead of some static line of code.

Cheers,

Maarten
--

Post by Mike Linksvayer

Post by Maarten Zeinstra
Thanks for the insights, I didn't realise https doesn't sent referrers. Seems logical though.

Browsers aren't supposed to send a referrer where the link is on an
insecure page and the target is a secure page.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec15.html#sec15.1.3
But sending a referrer is always at the option of the client, and in
my experience, referrer isn't sent going from insecure->secure either.
I don't guarantee this will always work. :)

HTTPSEverywhere generally defeats the scraper, as most license links
are to http://creativecommons... and HTTPSEverywhere either causes the
referrer to be dropped and/or confuses the scraper. Should be possible
to mitigate this by always using https for license deeds, including
providing https urls for links. CC should probably do this.
I don't know that AdBlockPlus does anything with referrer; Ghostery
may, I haven't used it in a long time in favor of
https://disconnect.me/ which I admit I haven't looked at whether it
does anything with referrer.
Another problem is that the scraper will probably miss anything using
modern RDFa (1.1 Lite), which is also a bit less fragile due to at
least not requiring a namespace declaration for common CC usecases. If
the scraper is useful at all it really ought be updated to support
this. Same for the HTML provided with the chooser and documentation.
And I think it makes sense to be neutral about formats and also look
for microdata and microformats annotations.
Those two things (https deed urls, rdfa 1.1 lite & co
support/publishing/documentation) I expect would make the probably
tiny fraction of deeds enriched go up a bit, but more important and
complementary is getting more large sites/widely used software to
publish and consume the annotations. For example Flickr did (may
still) add some RDFa to photo pages, but it was always somewhat
broken. On the consumption side, which is more important IMO, the deed
scraper is it; the intention (again from my perspective) was to close
to the loop, introducing, a, any consumer, so that the annotations had
*some* visibility, hopefully spurring more (but that spurring requires
a lot more finishing, documentation, evangelism that we never got to
for the most part). I haven't followed it closely at all, but maybe
some of Jonas Oberg's work will push in that direction, whether it is
ever reflected in the CC deeds or not.
There was at least discussion several years ago of logging
scraper-scraped metadata so that we could analyze its usage. I don't
remember whether that was set up, but certainly the analysis was never
done. That'd be another thing that could be done, if CC wanted to.
Some info can also be gleaned by crawling the web, or analyzing
others' crawls. I took a look at some low-hanging fruit in that regard
awhile back, and it didn't look great ...
http://gondwanaland.com/mlog/2012/01/23/attribution-crawl/

Yes. Consider how much less useful the web would be if you could only
link to a site, not a page within a site. Practice of linking to the
homepage of a site that a resource is on rather than the resource
itself is crippling as an attribution url in exactly that way.
If that's too handwavy, consider that you remixed one of my images,
and link to my homepage as attribution. The intent of the license (I
would have used CC0, but generic "I"...) is that the third party can
take advantage of the license offered by me in the original work. If
they have to dig around on my site to find the work instead of
directly going to it, this advantage is substantially diminished.
Mike

Mike Linksvayer

2013-10-01 17:18:15 UTC

Permalink

Post by Maarten Zeinstra
It indeed looks like the scraper needs a thorough rebuild to be properly used in todays internet. One solution to bypass this is to create a tool that works on the licensor page, not on the deed page. That way we would be able to use the info on that page and not look for referrers.

Yes, that's the idea behind client tools, I guess the latest iteration
of which was OpenAttribute. These days, or rather for many years now,
javascript the publisher could include on their site would be nice.

Post by Maarten Zeinstra
Which you could argue is something the Creative Commons should perhaps never do in the first place..

I did, long ago, and eventually gave up. :-\

Post by Maarten Zeinstra
You are right on the attributionURL, but from a practical position it makes it more difficult to have a license in the footer of your page right. Because now I need to add logic instead of some static line of code.

attributionURL=""

Empty string is the current page. It does work with the scraper.

This is my favorite kind of enhancement -- gained by only deletion.

Mike

Maarten Zeinstra

2013-10-07 13:29:00 UTC

Permalink

Hi Mike,

Ah yes, there just seem to be any interest/funding in this development and advocacy around this.

You say that you should simple point AttributionURL to "" but that is not as simple with the examples I found. I now use:

<a xmlns:cc="http://creativecommons.org/ns#" href="http://www.creativecommons.nl/" property="cc:attributionName" rel="cc:attributionURL">
Creative Commons Nederland</a>

Should I change that into

<span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">
Creative Commons Nederland</span>

or should I do something else? I tried this today but the metadata scraper does not seem to work today.

Cheers,

Maarten
--

Post by Mike Linksvayer

Post by Maarten Zeinstra
Which you could argue is something the Creative Commons should perhaps never do in the first place..

I did, long ago, and eventually gave up. :-\

attributionURL=""
Empty string is the current page. It does work with the scraper.
This is my favorite kind of enhancement -- gained by only deletion.
Mike

Mike Linksvayer

2013-10-07 17:17:29 UTC

Permalink

Sorry, what I wrote was misleading, looked like syntax but isn't.

Post by Maarten Zeinstra
<a xmlns:cc="http://creativecommons.org/ns#" href="http://www.creativecommons.nl/" property="cc:attributionName" rel="cc:attributionURL">
Creative Commons Nederland</a>
Should I change that into
<span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">
Creative Commons Nederland</span>
or should I do something else? I tried this today but the metadata scraper does not seem to work today.

Something else:

<a xmlns:cc="http://creativecommons.org/ns#" href=""
property="cc:attributionName" rel="cc:attributionURL">Creative Commons
Nederland</a>

Mike

Dan Mills

2013-10-07 19:21:19 UTC

Permalink

Post by Maarten Zeinstra
Hi Mike,
Thanks for the extensive reply.
It indeed looks like the scraper needs a thorough rebuild to be properly used in todays internet. One solution to bypass this is to create a tool that works on the licensor page, not on the deed page. That way we would be able to use the info on that page and not look for referrers. Which you could argue is something the Creative Commons should perhaps never do in the first place..

Tools that run on the licensor's page are one piece of the puzzle.
Thanish, one of this year's GSoC interns, mocked these up a couple of
months ago:

http://mnmtanish.github.io/cc-attribution-helper/widget.designs/widget.old/single.html
http://mnmtanish.github.io/cc-attribution-helper/widget.designs/widget.fullscreen/index.html

Icon positioning is off on the 2nd one, should be halfway on the
image/halfway sticking out. But should give you an idea anyway.

I stopped focusing on those for the time being, for a few reasons
including that it's hard to imagine how it would work for a
non-technical user in isolation of other services (like: a works DB
where the details can be maintained, software to automatically
recognize/track content, etc).

So I'm currently focusing on a specific vertical (k-12 education), and
how they remix content. In doing so, I can build a specific flavor of
attribution as it matters to those users. See the storyboard at the
end of this page:

http://wiki.creativecommons.org/Products/Pasteboard

Not sure what the attribution block would look like, that's just a
rough sketch. Full metadata details would be elsewhere (linked to at
the bottom of the block).

In any case, it's abundantly clear to me that the scraper approach has
outlived its usefulness. Very few people add the required metadata
into their pages, SSL is now becoming widespread, and we need to think
about non-desktop app platforms as well[1], where the scraper is
meaningless. We should deprecate the scraper and move on.

Dan

[1]: http://www.wirfs-brock.com/allen/posts/490