Web Excursions 2022-09-05

Platy Hsu

Sep 06, 2022

This is an extraordinary decision for us to make and,
- given Cloudflare's role as an Internet infrastructure provider, a dangerous one that we are not comfortable with.
However, the rhetoric on the Kiwifarms site and specific, targeted threats have escalated over the last 48 hours
- to the point that we believe there is an unprecedented emergency and immediate threat to human life
- unlike we have previously seen from Kiwifarms or any other customer before.
- Revolting content alone does not create an emergency situation that necessitates the action we are taking today.
Beginning approximately two weeks ago,
- a pressure campaign started with the goal to deplatform Kiwifarms.
- That pressure campaign targeted Cloudflare as well as other providers utilized by the site.
We have never been their hosting provider.
- In a law-respecting world, the answer to even illegal content is not to use other illegal means like DDoS attacks to silence it.
- We are also not taking this action directly because of the pressure campaign.
as the pressure campaign escalated, so did the rhetoric on the Kiwifarms site.
- Feeling attacked, users of the site became even more aggressive.
- Over the last two weeks, we have proactively reached out to law enforcement in multiple jurisdictions
- unfortunately the process is moving more slowly than the escalating risk.
- the imminent and emergency threat to human life which continues to escalate causes us to take this action.
Hard cases make bad law.
- This is a hard case and we would caution anyone from seeing it as setting precedent.
Unfortunately, that mechanism does not exist and so we are making this uncomfortable emergency decision alone.
- we are aware and concerned that our action may only fan the flames of this emergency.
There is real risk that by taking this action today we may have further heightened the emergency.
- it by no means solves the underlying problem.
- That solution will require much more work across society.

Discontinuing Bibliogram

The origin

I started Bibliogram in early 2020
A much-requested feature I added early on was RSS feeds.
- This ended up getting quickly turned off for the main instance, because RSS usage was dwarfing interactive usage.
- Many of these feeds had been added to people's readers and forgotten about.
Feed requests aren't free.
- Bibliogram needs to make an outgoing web request, wait for it, and convert the response data.
- This also uses up a piece of Bibliogram's rate limit to Instagram, even if nobody's there to see the feed that Bibliogram generates.

The rate limit

Bibliogram rate-limits access to its servers to stop people from doing the exact thing I'm doing.
There's a parameter in profile pages called rhx_gis and your application needed to remember this parameter so that it can use it in subsequent requests.
- If you use the wrong rhx_gis, you're locked out.
- Instagram used this in the past, but didn't require it when I started working on Bibliogram in January 2020.

January 2020: main profile page

After 100 or so requests to profile pages like instagram.com/rkrkrk they'll stop returning a useful response until you cool down.
- Timeline continuations weren't limited,
  - but you could only access the timeline if you knew the internal user ID, and
  - you could only get that ID from the profile page.
- So if you'd accessed a profile page in the past,
  - you could store the ID and you only needed to access the timeline continuation from then on,
  - which wasn't limited.
Currently, the limit on requests to profile pages is way less than 100.

June 2020: profile page blocked for servers

You can now only access a profile page if you're in somebody's house in real life — so not if you're a server on the internet.
a few ways of working around this:
- For finding user IDs, the assistants feature was added.
  - Trusted people could run the assistant program at home, which would collect user IDs (and nothing else) on behalf of Bibliogram, and Bibliogram instances could share between each other all the user IDs they already knew about.
- the import script, which copied all user ID mappings from one instance to another.
- a browser userscript people could install to let them access a specific user ID.

July 2020: /feed/ bypass

a URL like instagram.com/rkrkrk/feed/ which is just like a regular profile URL but with /feed/ on the end.
I put /feed/ into the Bibliogram code and all is well.

December 2020: /feed/ blocked for some servers

Late January 2021: graphql mostly blocked

Each graphql request has a different set of rules based on the matrix of whether you're at home or a cloud server or accessing via Tor, and which query_hash you're accessing.
- There's about 4 different behaviours that are fixed for a particular location-endpoint combination, but are seemingly assigned randomly.
After fixing this, I guess nothing really happened for a while

July 2022: overhaul

Instagram radically changed the way it internally arranges the data in its pages, requiring new ways to make requests and new ways to parse through it.
For the profile page, there are 4 different ways that the data might be provided, and your extractor has to handle all 4.
It seems to switch which format is being used every few hours.
- If it's not the right time, then the exractor you're using will fail.
Here are the formats:
- iweb, pass in the username, get user object json.
  - rate limit maybe 50 or 100 per hour?
- instagram.com/rkrkrk/?__a=1 ajax after original page load, tiny tiny tiny rate limit.
- https://www.instagram.com/rkrkrk and extract _sharedData, similarly tiny tiny tiny rate limit.
  - You can try the /feed/ workaround, which used to give more requests, but this appears to finally be patched now.
  - /feed/ might only work from specific classes of IP addresses.
- instagram.com/rkrkrk or /feed/ and extract PolarisQueryPreloaderCache.
While it is still possible to write code to handle these methods and switch between them,
- some of them are rate limited too heavily to make Bibliogram viable at those times.
- Tor seems to be restricted further, though not completely.

Bots accessing Bibliogram

They were designed specifically to scrape data from Bibliogram,
- and the owners were apparently too lazy to run their own instance of Bibliogram, or to contact me asking for help setting one up.
In August 2020 I blocked various proxy networks from accessing my site, then dealt with the really bad offenders on a case-by-case basis.
Later on, I'd create a system where Bibliogram dynamically applies its own rate-limiting system to anyone accessing it.
I think a mistake I made here was the faceless approach to blocking people who are being a problem.
- They treat being blocked as a puzzle to overcome, and Bibliogram as just another faceless website,
- rather than something being run by real people who want to help.

Why is Bibliogram discontinued?

The simultaneous crackdown on /feed/ and Tor and with needing to write new code to scrape the page
- is too much for me to bother with,
- especially when I am working on it in my spare time and have no personal interest or incentive.

What does discontinued mean?

You can't look at profiles.
- You can still look at individual posts, but if this breaks in the future, I probably won't fix it.
The main instance, bibliogram.art, will shut down unless somebody offers to take over running it.
More Instagram workarounds are definitely possible due to its code still being dogshit, but I don't have the energy to look for more workarounds myself.

Platy’s Web Excursions

Discussion about this post

Platy’s Web Excursions

Web Excursions 2022-09-05

Blocking Kiwifarms

Discontinuing Bibliogram

Discussion about this post