jojoleprowebsite

[Done] The Sauce of https://jojolepro.com
git clone https://git.jojolepro.com/jojoleprowebsite.git
Log | Files | Refs | README | LICENSE

commit d925faf613cd6c261556730a353207059c26c90b
parent e43a0b8977e5fa20c6b7c3eb8ea7059a2f2998b0
Author: Joël Lupien (Jojolepro) <jojolepro@jojolepro.com>
Date:   Thu,  9 Apr 2020 21:23:22 -0400

first build

Diffstat:
Mbuild.sh | 37++++++++++++++++++++++++++++++++++++-
Abuild/blog/2020-03-31_extracting_data_from_websites/curl.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/image_element.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/index.html | 299+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Abuild/blog/2020-03-31_extracting_data_from_websites/index.txt | 233+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Abuild/blog/2020-03-31_extracting_data_from_websites/network_sources.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/network_sources2.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/network_sources3.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/preloaded.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/response.png | 0
Abuild/blog/2020-03-31_extracting_data_from_websites/website.png | 0
Abuild/index.html | 67+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Abuild/index.txt | 1+
Asrc/blog/2020-03-31_extracting_data_from_websites/curl.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/image_element.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/index.txt | 233+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Asrc/blog/2020-03-31_extracting_data_from_websites/network_sources.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/network_sources2.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/network_sources3.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/preloaded.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/response.png | 0
Asrc/blog/2020-03-31_extracting_data_from_websites/website.png | 0
Asrc/index.txt | 1+
Mtemplate.html | 4+---
24 files changed, 871 insertions(+), 4 deletions(-)

diff --git a/build.sh b/build.sh @@ -1,5 +1,40 @@ -#!/bin/sh -e +#!/bin/sh rm -r ./build mkdir -p ./build +cp ./template.html /tmp/template.html +cd ./build +(cd ../src; find . -type f -a -not -path \*/\.\* -a -not -path ./templates/\*) | + +while read -r page; do + mkdir -p "${page%/*}" + + case $page in + *.txt) + sed -E "s|([^=][^\'\"])(https[:]//[^ )]*)|\1<a href='\2'>\2</a>|g" \ + "../src/$page" | + + sed -E "s|^(https[:]//[^ )]{50})([^ )]*)|<a href='\0'>\1</a>|g" | + + sed -E "s/^\.\/.*\.(png|jpg|svg)/<img src='\0'\/>/g" | + + sed '/%%CONTENT%%/r /dev/stdin' /tmp/template.html | + sed '/%%CONTENT%%/d' | + + sed "s %%SOURCE%% /${page##./} " \ + > "${page%%.txt}.html" + + ln -f "../src/$page" "$page" + + printf '%s\n' "CC $page" + ;; + + # Copy over any images or non-txt files. + *) + cp "../src/$page" "$page" + + printf '%s\n' "CP $page" + ;; + esac +done diff --git a/build/blog/2020-03-31_extracting_data_from_websites/curl.png b/build/blog/2020-03-31_extracting_data_from_websites/curl.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/image_element.png b/build/blog/2020-03-31_extracting_data_from_websites/image_element.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/index.html b/build/blog/2020-03-31_extracting_data_from_websites/index.html @@ -0,0 +1,299 @@ +<!doctype html> +<html lang=en> + <head> + <link href='data:image/png;base64,R0lGODlhGAAYAOeIAAAAALVSwLlUxLxXyLlZxMlb1cpc1hScoRSepNBf3c9h29Rg4NJh39Nh4BWhps9j3NJi3s1k2dVh4tBk3NVi4tNj39Zi49di49di5NBl3NVj4tNk39hi5dZj49Zj5Nli5dRk4ddj5NVk4thj5dti5tNl4Nhj5tZk49Fm3tlj5dRl4Nlj5tdk49dk5Npj5ddk5dVl4dhk5dZl49lk5tdl5dpk5tZm49Nn5NVn4tVn5dZn49Ro4Nlm5tNo5ddn5NBq29Vo4dVo4haortln5dRp4Ndo49Zp49Nq5Raqr9pp59hr5NFu5Nls5s1y5BevtBawtcp25BizuRi1uhm4vi20wbqG4hm7wRq8whm9w7yI4raM4Bq/xRq/xrSO4RrByBrDyRvDyhvFyxvGzBvHzRvHzqeb4BvIzhvIzxzIzxvJz6ad4CTH0aKf3xzK0BzK0RzL0p6i3xzN1BzP1R3P1pGr3h3Q15Or3RzR2CPP2R3R2B3R2R/R2RzS2R3S2SfP2Yex3Yey3TPN2TbN2Xm43DDP2Xa63EXK2kbK2v///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////yH+B2hhaSBoYWkAIfkEAQoA/wAsAAAAABgAGAAACP4A/wkcSLCgwYMFASgEgLDhQiFWrkxZ2JAgACl58vTZuDGPwor/ALzhyFEMmI1uGCIEUKfPHABkODoAMGZjHZUJ02wEEAUARwBofNrEOVBoH5EAzuyMA0Djxi9EAXD5mWdOS5JYo2Ld+pNjGpUAEGwVmrLPlqMkwSLhCgAqAgAHjG40wxCAE656ADzBQwWAHz5ZQ2LhashOlR4uSOToAkjQz5BesB5ic4QHiyQyiFiokQJKoZ8ATnIctCQGBCYqRHjYoWPIihUvyhA6CiDMxj10aFhI8MBGhghBZnTg8fp1FkIKa/aBk2LFhQUlbOAoooHDiRbFX2tR2EbPnw+vR0NMMAJEAQwQFBpoCDFCgoCFAOQEapJ9Q5EKDFCUKPKDAPyid6hhQnZK+GDAAPARVdQaN7yGQQEB/AdSSAlOaOGFAgUEADs=' + rel=icon> + <title>Jojolepro.com</title> + <meta charset=utf-8> + <meta name=description content="A bunch of stuff and things!"> + <style> + body{ + text-align:center; + overflow-y:scroll; + /*font:calc(0.75em + 1vmin) monospace;*/ + font: 1.3em monospace; + background-color: #181a1b; + border-color: #575757; + color: #e8e6e3; + } + pre{ + text-align:left; + display:inline-block; + } + img{ + max-width:57ch; + display:block; + height:auto; + width:100%; + } + a { + color: #3391ff; + } + * { + scrollbar-color: #2a2c2e #1c1e1f; + } + /*@media(prefers-color-scheme:dark){ + body{ + background:#000; + color:#fff; + } + a{ + color:#6CF; + } + }*/ + </style> + </head> + <body> + <nav> + <a href=/><b>Jojolepro</b></a> + <br/> + <a href=/blog>Blog </a> + <a href=https://github.com/jojolepro/>GitHub</a> + </nav> + <br/> + <article> + <pre> +Extracting Data From A Website +================================================================================ + +Sometimes, you want to get the data or images that is visible on your screen +while visiting a websites, and its not too clear how you can get it +automatically. + +Here, I will show you a general process of doing just that. +I will first explain the important concepts and illustrate them using a real +life example. + +If you know how websites work, you can skip the next section. + +Web Pages Basics +=============================================================================== + +Let's first discuss how web pages get filled with data, being images or +text/numbers. +There is usually two locations where this happens: +On the server or on the client. + +Filled Server Side +-------------------------------------------------------------------------------- + +This is what older website usually do. +A webpage is usually composed of 4 things: +- Html (.html): The layout of the page and the text. +- Css (.css): The style (we don't care about this for now.) +- Javascript (.js): Executable code that will change the page. +- Other ressources: Images, files, etc. + +When requesting a web page, your web browser asks the web server to give it an +html file. The server takes the local html file, inserts the data into it +(being it text, numbers or paths to images) and then sends it to you. + +In this case, getting the data is easy. More on this a bit lower. + +Filled Client Side +-------------------------------------------------------------------------------- + +Now, new websites usually want to be more dynamic. That is, they want to load +content into the page without completely requesting a new html file. +They do this by using Javascript. Here's what happens: +- Your browser asks the server for the html file. +- The server gives you the html file without filling in the data. +- Your browser reads the html and executes the javascript code that it finds + (or it downloads it first, if it is on the server.) +- The javascript asks the server for the data, when it is needed. +- The server gives the data. +- The javascript inserts the data into the html. + +As you can see, it is definitely a bit more complex here. +The good news is that with a bit of intuition, we can pretty easily see what +the javascript asks to the server. In this situation, we will pretend to be the +javascript getting data. + +Identifying What We Want +================================================================================ + +No matter the way the page was made, we first need to identify what we want. +As an example to illustrate what we will be doing, I will be extracting weather +radar data from a dynamic government website. + +https://weather.gc.ca/radar/index_e.html?id=WUJ + +I bet the link will eventually cease to work, but in any case, you can follow +along with any website in the next sections in case this one doesn't work +anymore. + +<img src='./website.png'/> + +Here, when we click the play button, the website will play an animation with +(usually) 7 pictures. A new picture is created every 10 minutes and the 7th is +deleted. +Now, what if we wanted to create an animation of the weather for the past day? +Since they don't provide an easy way to get old pictures, we will have to +download them for ourselves as time goes on. + +Let's start by getting an understanding of how those are stored in the html. +Assuming you have access to developer tools (I use Firefox as my web browser, +but it should work on Chrome and others), you can right click the picture and +click "Inspect Element". + +You should now see that there is quite a few "img" elements in there. In most +cases, the one selected is the one you want. If it is not, you can +find the one you want by right clicking on the different "src=" fields and +clicking "Open link in a new tab". +Sometimes, like in this case, the image I want is actually "deeper" in the +hierarchy. +That is, more to the right and inside another element. +In our case, it is the image inside the "div id=animation-image" element. + +<img src='./image_element.png'/> + +Now this is weird, there is only one picture in there. If you click the play +button of the animation however, you will see that a new "div" appears right +under our image in the inspector. Let's open it! + +<img src='./preloaded.png'/> + +Ah, here's the 7 images that play in the animator. +What's happening here is that this website only loads the images once you start +playing the animation. This is often what dynamic websites do; they will load +data (like images) only when you need to see them. + +Server-Filled +-------------------------------------------------------------------------------- +In the case of a web paged filled on the server, you will already have the +image in there as soon as the page is first loaded. If this is your case, you +can simply use something like the following to extract the image source: + +$ wget "https://blablabla" | pup 'img#the-id-of-the-image attr{src}' + +(you can find pup here: <a href='https://github.com/EricChiang/pup'>https://github.com/EricChiang/pup</a>) + +If your web page is not like this, continue reading. + +Understanding The Pattern +================================================================================ + +Here, we see that the source (src) of each image is composed of a name, a date +and a time. +We can guess that there are two ways those names got there. Either the +javascript we are executing knows the pattern (use time increments of 10 +minutes) or, the more likely option, the server simply told us the names when +the javascript asked for them. + +We will assume that the javascript asked for those names, as it is more common. + +Identifying Our Sources +================================================================================ + +Let's switch to the "Network" tab of the developer tools and then refresh the +page. +You should see something like this: + +<img src='./network_sources.png'/> + +Each of this is a request that our browser did to the server. That's a lot! +Let's scroll all the way up. + +<img src='./network_sources2.png'/> + +We see that our first request is for the html file. +Then we get our css style files and also javascript files. +The css files are not interesting to us and usually the javascript files won't +be either. You can see the types of files in the "Type" column. + +We will skip all the css and js files. +Since we determined in the last section that our website is filled dynamically, +we can also ignore the image file types. +Let's look at what remains + +<img src='./network_sources3.png'/> + +Ahhhh, now it is way easier to see what is going on. +We have our html page, some "retrieve" file and a site menu. +Let's click on the retrieve file and in the right tab, open the "Response" tab. + +<img src='./response.png'/> + +Here! All our image sources are now available to us! +Now, let's right click the file, open in a new tab aaaaaaand... nothing. +This is something that will often happen with requests for data. The server +will usually expect some data that is not there when opening only using the +file source. + +Go back into the network tab and right click on the file. Inside of the "Copy" +section, select "Copy as cURL". We will now open a terminal and copy what we +got inside. Magically, all our data is now here! + +<img src='./curl.png'/> + +Simplyfing Our Request +================================================================================ + +As you can see, your request is really long and complex. By trial and error, we +can remove sections of it. + +Here, the following request works: + +$ curl "https://weather.gc.ca/radar/xhr.php?action=retrieve&target=images"\ +"&region=WUJ&product=PRECIP_SNOW&lang=en-CA&format=json" \ +-H 'X-Requested-With:XMLHttpRequest' > /tmp/data.txt + +Getting Our Sources +================================================================================ + +Our query works, now it is time to get the sources for those images. +Let's use `jq`, a tool that will make our life much easier to extract the image +links from this mess. +More details on jq can be found here: <a href='https://stedolan.github.io/jq/'>https://stedolan.github.io/jq/</a> + +This command will extract the urls from the big text we have. +$ cat /tmp/data.txt | jq -r '.short | map(.src) | .[]' +/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_23_20.GIF +/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_23_30.GIF +... + +Since those aren't actually urls (they are missing <a href='https://....'>https://....</a>), we will add +those. + +$ cat /tmp/data.txt | jq -r '.short | map(.src) | .[]' | sed \ +'s/^/https:\/\/weather.gc.ca/' > /tmp/urls.txt + +If we look at our file, we see that everything looks good: + +$ cat /tmp/urls.txt +<a href='https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_...'>https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECI</a> +<a href='https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_...'>https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECI</a> +.... + +and finally, we can download our images! + +cd /tmp +wget -i /tmp/urls.txt + +Conclusion +================================================================================ + +While I tried to make this as simple as possible, this is not a simple topic. +It requires knowledge of the web and of the linux tools. + +This should be viewed more as an overview of the concepts than as a full +explanation of every tool we used. If this interests you, I would suggest you +go through all the tools and concepts we touched on to get and idea of what +everything is or does. Notably: +- Server sided data insertion (See PHP) +- Client sided data requests (See Javascript/jQuery) +- Linux cat/pipes/output redirection +- The jq command +- The wget and curl commands + + </pre> + </article> + <br/> + <footer> + <div> + (C) Joël Lupien 2020-2020 + </div> + <a href="/blog/2020-03-31_extracting_data_from_websites/index.txt">View page source</a> + </footer> + </body> +</html> diff --git a/build/blog/2020-03-31_extracting_data_from_websites/index.txt b/build/blog/2020-03-31_extracting_data_from_websites/index.txt @@ -0,0 +1,233 @@ +Extracting Data From A Website +================================================================================ + +Sometimes, you want to get the data or images that is visible on your screen +while visiting a websites, and its not too clear how you can get it +automatically. + +Here, I will show you a general process of doing just that. +I will first explain the important concepts and illustrate them using a real +life example. + +If you know how websites work, you can skip the next section. + +Web Pages Basics +=============================================================================== + +Let's first discuss how web pages get filled with data, being images or +text/numbers. +There is usually two locations where this happens: +On the server or on the client. + +Filled Server Side +-------------------------------------------------------------------------------- + +This is what older website usually do. +A webpage is usually composed of 4 things: +- Html (.html): The layout of the page and the text. +- Css (.css): The style (we don't care about this for now.) +- Javascript (.js): Executable code that will change the page. +- Other ressources: Images, files, etc. + +When requesting a web page, your web browser asks the web server to give it an +html file. The server takes the local html file, inserts the data into it +(being it text, numbers or paths to images) and then sends it to you. + +In this case, getting the data is easy. More on this a bit lower. + +Filled Client Side +-------------------------------------------------------------------------------- + +Now, new websites usually want to be more dynamic. That is, they want to load +content into the page without completely requesting a new html file. +They do this by using Javascript. Here's what happens: +- Your browser asks the server for the html file. +- The server gives you the html file without filling in the data. +- Your browser reads the html and executes the javascript code that it finds + (or it downloads it first, if it is on the server.) +- The javascript asks the server for the data, when it is needed. +- The server gives the data. +- The javascript inserts the data into the html. + +As you can see, it is definitely a bit more complex here. +The good news is that with a bit of intuition, we can pretty easily see what +the javascript asks to the server. In this situation, we will pretend to be the +javascript getting data. + +Identifying What We Want +================================================================================ + +No matter the way the page was made, we first need to identify what we want. +As an example to illustrate what we will be doing, I will be extracting weather +radar data from a dynamic government website. + +https://weather.gc.ca/radar/index_e.html?id=WUJ + +I bet the link will eventually cease to work, but in any case, you can follow +along with any website in the next sections in case this one doesn't work +anymore. + +./website.png + +Here, when we click the play button, the website will play an animation with +(usually) 7 pictures. A new picture is created every 10 minutes and the 7th is +deleted. +Now, what if we wanted to create an animation of the weather for the past day? +Since they don't provide an easy way to get old pictures, we will have to +download them for ourselves as time goes on. + +Let's start by getting an understanding of how those are stored in the html. +Assuming you have access to developer tools (I use Firefox as my web browser, +but it should work on Chrome and others), you can right click the picture and +click "Inspect Element". + +You should now see that there is quite a few "img" elements in there. In most +cases, the one selected is the one you want. If it is not, you can +find the one you want by right clicking on the different "src=" fields and +clicking "Open link in a new tab". +Sometimes, like in this case, the image I want is actually "deeper" in the +hierarchy. +That is, more to the right and inside another element. +In our case, it is the image inside the "div id=animation-image" element. + +./image_element.png + +Now this is weird, there is only one picture in there. If you click the play +button of the animation however, you will see that a new "div" appears right +under our image in the inspector. Let's open it! + +./preloaded.png + +Ah, here's the 7 images that play in the animator. +What's happening here is that this website only loads the images once you start +playing the animation. This is often what dynamic websites do; they will load +data (like images) only when you need to see them. + +Server-Filled +-------------------------------------------------------------------------------- +In the case of a web paged filled on the server, you will already have the +image in there as soon as the page is first loaded. If this is your case, you +can simply use something like the following to extract the image source: + +$ wget "https://blablabla" | pup 'img#the-id-of-the-image attr{src}' + +(you can find pup here: https://github.com/EricChiang/pup) + +If your web page is not like this, continue reading. + +Understanding The Pattern +================================================================================ + +Here, we see that the source (src) of each image is composed of a name, a date +and a time. +We can guess that there are two ways those names got there. Either the +javascript we are executing knows the pattern (use time increments of 10 +minutes) or, the more likely option, the server simply told us the names when +the javascript asked for them. + +We will assume that the javascript asked for those names, as it is more common. + +Identifying Our Sources +================================================================================ + +Let's switch to the "Network" tab of the developer tools and then refresh the +page. +You should see something like this: + +./network_sources.png + +Each of this is a request that our browser did to the server. That's a lot! +Let's scroll all the way up. + +./network_sources2.png + +We see that our first request is for the html file. +Then we get our css style files and also javascript files. +The css files are not interesting to us and usually the javascript files won't +be either. You can see the types of files in the "Type" column. + +We will skip all the css and js files. +Since we determined in the last section that our website is filled dynamically, +we can also ignore the image file types. +Let's look at what remains + +./network_sources3.png + +Ahhhh, now it is way easier to see what is going on. +We have our html page, some "retrieve" file and a site menu. +Let's click on the retrieve file and in the right tab, open the "Response" tab. + +./response.png + +Here! All our image sources are now available to us! +Now, let's right click the file, open in a new tab aaaaaaand... nothing. +This is something that will often happen with requests for data. The server +will usually expect some data that is not there when opening only using the +file source. + +Go back into the network tab and right click on the file. Inside of the "Copy" +section, select "Copy as cURL". We will now open a terminal and copy what we +got inside. Magically, all our data is now here! + +./curl.png + +Simplyfing Our Request +================================================================================ + +As you can see, your request is really long and complex. By trial and error, we +can remove sections of it. + +Here, the following request works: + +$ curl "https://weather.gc.ca/radar/xhr.php?action=retrieve&target=images"\ +"&region=WUJ&product=PRECIP_SNOW&lang=en-CA&format=json" \ +-H 'X-Requested-With:XMLHttpRequest' > /tmp/data.txt + +Getting Our Sources +================================================================================ + +Our query works, now it is time to get the sources for those images. +Let's use `jq`, a tool that will make our life much easier to extract the image +links from this mess. +More details on jq can be found here: https://stedolan.github.io/jq/ + +This command will extract the urls from the big text we have. +$ cat /tmp/data.txt | jq -r '.short | map(.src) | .[]' +/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_23_20.GIF +/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_23_30.GIF +... + +Since those aren't actually urls (they are missing https://....), we will add +those. + +$ cat /tmp/data.txt | jq -r '.short | map(.src) | .[]' | sed \ +'s/^/https:\/\/weather.gc.ca/' > /tmp/urls.txt + +If we look at our file, we see that everything looks good: + +$ cat /tmp/urls.txt +https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_... +https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_... +.... + +and finally, we can download our images! + +cd /tmp +wget -i /tmp/urls.txt + +Conclusion +================================================================================ + +While I tried to make this as simple as possible, this is not a simple topic. +It requires knowledge of the web and of the linux tools. + +This should be viewed more as an overview of the concepts than as a full +explanation of every tool we used. If this interests you, I would suggest you +go through all the tools and concepts we touched on to get and idea of what +everything is or does. Notably: +- Server sided data insertion (See PHP) +- Client sided data requests (See Javascript/jQuery) +- Linux cat/pipes/output redirection +- The jq command +- The wget and curl commands + diff --git a/build/blog/2020-03-31_extracting_data_from_websites/network_sources.png b/build/blog/2020-03-31_extracting_data_from_websites/network_sources.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/network_sources2.png b/build/blog/2020-03-31_extracting_data_from_websites/network_sources2.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/network_sources3.png b/build/blog/2020-03-31_extracting_data_from_websites/network_sources3.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/preloaded.png b/build/blog/2020-03-31_extracting_data_from_websites/preloaded.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/response.png b/build/blog/2020-03-31_extracting_data_from_websites/response.png Binary files differ. diff --git a/build/blog/2020-03-31_extracting_data_from_websites/website.png b/build/blog/2020-03-31_extracting_data_from_websites/website.png Binary files differ. diff --git a/build/index.html b/build/index.html @@ -0,0 +1,67 @@ +<!doctype html> +<html lang=en> + <head> + <link href='data:image/png;base64,R0lGODlhGAAYAOeIAAAAALVSwLlUxLxXyLlZxMlb1cpc1hScoRSepNBf3c9h29Rg4NJh39Nh4BWhps9j3NJi3s1k2dVh4tBk3NVi4tNj39Zi49di49di5NBl3NVj4tNk39hi5dZj49Zj5Nli5dRk4ddj5NVk4thj5dti5tNl4Nhj5tZk49Fm3tlj5dRl4Nlj5tdk49dk5Npj5ddk5dVl4dhk5dZl49lk5tdl5dpk5tZm49Nn5NVn4tVn5dZn49Ro4Nlm5tNo5ddn5NBq29Vo4dVo4haortln5dRp4Ndo49Zp49Nq5Raqr9pp59hr5NFu5Nls5s1y5BevtBawtcp25BizuRi1uhm4vi20wbqG4hm7wRq8whm9w7yI4raM4Bq/xRq/xrSO4RrByBrDyRvDyhvFyxvGzBvHzRvHzqeb4BvIzhvIzxzIzxvJz6ad4CTH0aKf3xzK0BzK0RzL0p6i3xzN1BzP1R3P1pGr3h3Q15Or3RzR2CPP2R3R2B3R2R/R2RzS2R3S2SfP2Yex3Yey3TPN2TbN2Xm43DDP2Xa63EXK2kbK2v///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////yH+B2hhaSBoYWkAIfkEAQoA/wAsAAAAABgAGAAACP4A/wkcSLCgwYMFASgEgLDhQiFWrkxZ2JAgACl58vTZuDGPwor/ALzhyFEMmI1uGCIEUKfPHABkODoAMGZjHZUJ02wEEAUARwBofNrEOVBoH5EAzuyMA0Djxi9EAXD5mWdOS5JYo2Ld+pNjGpUAEGwVmrLPlqMkwSLhCgAqAgAHjG40wxCAE656ADzBQwWAHz5ZQ2LhashOlR4uSOToAkjQz5BesB5ic4QHiyQyiFiokQJKoZ8ATnIctCQGBCYqRHjYoWPIihUvyhA6CiDMxj10aFhI8MBGhghBZnTg8fp1FkIKa/aBk2LFhQUlbOAoooHDiRbFX2tR2EbPnw+vR0NMMAJEAQwQFBpoCDFCgoCFAOQEapJ9Q5EKDFCUKPKDAPyid6hhQnZK+GDAAPARVdQaN7yGQQEB/AdSSAlOaOGFAgUEADs=' + rel=icon> + <title>Jojolepro.com</title> + <meta charset=utf-8> + <meta name=description content="A bunch of stuff and things!"> + <style> + body{ + text-align:center; + overflow-y:scroll; + /*font:calc(0.75em + 1vmin) monospace;*/ + font: 1.3em monospace; + background-color: #181a1b; + border-color: #575757; + color: #e8e6e3; + } + pre{ + text-align:left; + display:inline-block; + } + img{ + max-width:57ch; + display:block; + height:auto; + width:100%; + } + a { + color: #3391ff; + } + * { + scrollbar-color: #2a2c2e #1c1e1f; + } + /*@media(prefers-color-scheme:dark){ + body{ + background:#000; + color:#fff; + } + a{ + color:#6CF; + } + }*/ + </style> + </head> + <body> + <nav> + <a href=/><b>Jojolepro</b></a> + <br/> + <a href=/blog>Blog </a> + <a href=https://github.com/jojolepro/>GitHub</a> + </nav> + <br/> + <article> + <pre> +Site index test + </pre> + </article> + <br/> + <footer> + <div> + (C) Joël Lupien 2020-2020 + </div> + <a href="/index.txt">View page source</a> + </footer> + </body> +</html> diff --git a/build/index.txt b/build/index.txt @@ -0,0 +1 @@ +Site index test diff --git a/src/blog/2020-03-31_extracting_data_from_websites/curl.png b/src/blog/2020-03-31_extracting_data_from_websites/curl.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/image_element.png b/src/blog/2020-03-31_extracting_data_from_websites/image_element.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/index.txt b/src/blog/2020-03-31_extracting_data_from_websites/index.txt @@ -0,0 +1,233 @@ +Extracting Data From A Website +================================================================================ + +Sometimes, you want to get the data or images that is visible on your screen +while visiting a websites, and its not too clear how you can get it +automatically. + +Here, I will show you a general process of doing just that. +I will first explain the important concepts and illustrate them using a real +life example. + +If you know how websites work, you can skip the next section. + +Web Pages Basics +=============================================================================== + +Let's first discuss how web pages get filled with data, being images or +text/numbers. +There is usually two locations where this happens: +On the server or on the client. + +Filled Server Side +-------------------------------------------------------------------------------- + +This is what older website usually do. +A webpage is usually composed of 4 things: +- Html (.html): The layout of the page and the text. +- Css (.css): The style (we don't care about this for now.) +- Javascript (.js): Executable code that will change the page. +- Other ressources: Images, files, etc. + +When requesting a web page, your web browser asks the web server to give it an +html file. The server takes the local html file, inserts the data into it +(being it text, numbers or paths to images) and then sends it to you. + +In this case, getting the data is easy. More on this a bit lower. + +Filled Client Side +-------------------------------------------------------------------------------- + +Now, new websites usually want to be more dynamic. That is, they want to load +content into the page without completely requesting a new html file. +They do this by using Javascript. Here's what happens: +- Your browser asks the server for the html file. +- The server gives you the html file without filling in the data. +- Your browser reads the html and executes the javascript code that it finds + (or it downloads it first, if it is on the server.) +- The javascript asks the server for the data, when it is needed. +- The server gives the data. +- The javascript inserts the data into the html. + +As you can see, it is definitely a bit more complex here. +The good news is that with a bit of intuition, we can pretty easily see what +the javascript asks to the server. In this situation, we will pretend to be the +javascript getting data. + +Identifying What We Want +================================================================================ + +No matter the way the page was made, we first need to identify what we want. +As an example to illustrate what we will be doing, I will be extracting weather +radar data from a dynamic government website. + +https://weather.gc.ca/radar/index_e.html?id=WUJ + +I bet the link will eventually cease to work, but in any case, you can follow +along with any website in the next sections in case this one doesn't work +anymore. + +./website.png + +Here, when we click the play button, the website will play an animation with +(usually) 7 pictures. A new picture is created every 10 minutes and the 7th is +deleted. +Now, what if we wanted to create an animation of the weather for the past day? +Since they don't provide an easy way to get old pictures, we will have to +download them for ourselves as time goes on. + +Let's start by getting an understanding of how those are stored in the html. +Assuming you have access to developer tools (I use Firefox as my web browser, +but it should work on Chrome and others), you can right click the picture and +click "Inspect Element". + +You should now see that there is quite a few "img" elements in there. In most +cases, the one selected is the one you want. If it is not, you can +find the one you want by right clicking on the different "src=" fields and +clicking "Open link in a new tab". +Sometimes, like in this case, the image I want is actually "deeper" in the +hierarchy. +That is, more to the right and inside another element. +In our case, it is the image inside the "div id=animation-image" element. + +./image_element.png + +Now this is weird, there is only one picture in there. If you click the play +button of the animation however, you will see that a new "div" appears right +under our image in the inspector. Let's open it! + +./preloaded.png + +Ah, here's the 7 images that play in the animator. +What's happening here is that this website only loads the images once you start +playing the animation. This is often what dynamic websites do; they will load +data (like images) only when you need to see them. + +Server-Filled +-------------------------------------------------------------------------------- +In the case of a web paged filled on the server, you will already have the +image in there as soon as the page is first loaded. If this is your case, you +can simply use something like the following to extract the image source: + +$ wget "https://blablabla" | pup 'img#the-id-of-the-image attr{src}' + +(you can find pup here: https://github.com/EricChiang/pup) + +If your web page is not like this, continue reading. + +Understanding The Pattern +================================================================================ + +Here, we see that the source (src) of each image is composed of a name, a date +and a time. +We can guess that there are two ways those names got there. Either the +javascript we are executing knows the pattern (use time increments of 10 +minutes) or, the more likely option, the server simply told us the names when +the javascript asked for them. + +We will assume that the javascript asked for those names, as it is more common. + +Identifying Our Sources +================================================================================ + +Let's switch to the "Network" tab of the developer tools and then refresh the +page. +You should see something like this: + +./network_sources.png + +Each of this is a request that our browser did to the server. That's a lot! +Let's scroll all the way up. + +./network_sources2.png + +We see that our first request is for the html file. +Then we get our css style files and also javascript files. +The css files are not interesting to us and usually the javascript files won't +be either. You can see the types of files in the "Type" column. + +We will skip all the css and js files. +Since we determined in the last section that our website is filled dynamically, +we can also ignore the image file types. +Let's look at what remains + +./network_sources3.png + +Ahhhh, now it is way easier to see what is going on. +We have our html page, some "retrieve" file and a site menu. +Let's click on the retrieve file and in the right tab, open the "Response" tab. + +./response.png + +Here! All our image sources are now available to us! +Now, let's right click the file, open in a new tab aaaaaaand... nothing. +This is something that will often happen with requests for data. The server +will usually expect some data that is not there when opening only using the +file source. + +Go back into the network tab and right click on the file. Inside of the "Copy" +section, select "Copy as cURL". We will now open a terminal and copy what we +got inside. Magically, all our data is now here! + +./curl.png + +Simplyfing Our Request +================================================================================ + +As you can see, your request is really long and complex. By trial and error, we +can remove sections of it. + +Here, the following request works: + +$ curl "https://weather.gc.ca/radar/xhr.php?action=retrieve&target=images"\ +"&region=WUJ&product=PRECIP_SNOW&lang=en-CA&format=json" \ +-H 'X-Requested-With:XMLHttpRequest' > /tmp/data.txt + +Getting Our Sources +================================================================================ + +Our query works, now it is time to get the sources for those images. +Let's use `jq`, a tool that will make our life much easier to extract the image +links from this mess. +More details on jq can be found here: https://stedolan.github.io/jq/ + +This command will extract the urls from the big text we have. +$ cat /tmp/data.txt | jq -r '.short | map(.src) | .[]' +/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_23_20.GIF +/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_23_30.GIF +... + +Since those aren't actually urls (they are missing https://....), we will add +those. + +$ cat /tmp/data.txt | jq -r '.short | map(.src) | .[]' | sed \ +'s/^/https:\/\/weather.gc.ca/' > /tmp/urls.txt + +If we look at our file, we see that everything looks good: + +$ cat /tmp/urls.txt +https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_... +https://weather.gc.ca/data/radar/temp_image//WUJ/WUJ_PRECIP_RAIN_2020_03_31_... +.... + +and finally, we can download our images! + +cd /tmp +wget -i /tmp/urls.txt + +Conclusion +================================================================================ + +While I tried to make this as simple as possible, this is not a simple topic. +It requires knowledge of the web and of the linux tools. + +This should be viewed more as an overview of the concepts than as a full +explanation of every tool we used. If this interests you, I would suggest you +go through all the tools and concepts we touched on to get and idea of what +everything is or does. Notably: +- Server sided data insertion (See PHP) +- Client sided data requests (See Javascript/jQuery) +- Linux cat/pipes/output redirection +- The jq command +- The wget and curl commands + diff --git a/src/blog/2020-03-31_extracting_data_from_websites/network_sources.png b/src/blog/2020-03-31_extracting_data_from_websites/network_sources.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/network_sources2.png b/src/blog/2020-03-31_extracting_data_from_websites/network_sources2.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/network_sources3.png b/src/blog/2020-03-31_extracting_data_from_websites/network_sources3.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/preloaded.png b/src/blog/2020-03-31_extracting_data_from_websites/preloaded.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/response.png b/src/blog/2020-03-31_extracting_data_from_websites/response.png Binary files differ. diff --git a/src/blog/2020-03-31_extracting_data_from_websites/website.png b/src/blog/2020-03-31_extracting_data_from_websites/website.png Binary files differ. diff --git a/src/index.txt b/src/index.txt @@ -0,0 +1 @@ +Site index test diff --git a/template.html b/template.html @@ -47,15 +47,13 @@ <nav> <a href=/><b>Jojolepro</b></a> <br/> + <a href=/blog>Blog </a> <a href=https://github.com/jojolepro/>GitHub</a> </nav> <br/> <article> <pre> %%CONTENT%% -Some other things. -xD -and a very shlong shlong ass line that spends the whole 80 chars I will try to. </pre> </article> <br/>