CNC Customize Parts Professional Solution & Processing Provider

how to build a timesmachine

by:QY Precision      2019-08-24
Earlier this year, we quietly expanded our virtual micro-movie reader time machine to include every issue of The New York Times, which was published between 1981 and 2002.
Prior to this expansion, The Time Machine contained each issue published between 1851 and 1980, including more than 11 million articles distributed in about 2 articles. Page 5 million.
The new extension adds 8,035 complete questions with 1.
More than 4 million articles 1. Page 6 million.
Creating and expanding Time Machines brings us several interesting technical challenges, and in this article we will describe how we can solve two problems.
First of all, we will discuss the basic challenges of time machines: efficiently providing users with a full day of newspaper scanning without the need to download hundreds of megabytes of data.
We will then discuss an interesting string matching issue, which we have to address in order to include the Post 1980 in The Time Machine.
Before the launch of TimesMachine in 2014, articles in the archive can be searched and provided to subscribers only as PDF documents.
While archives are accessible, there are still two major problems in the implementation process: the environment and the user experience.
Isolating an article from the surrounding content removes the context it publishes.
Modern readers may find that John ferfax became the first man to cross the Atlantic in July 20, 1969.
However, one of the readers of The New York Times that morning may be more impressed with the front page news, that is, Apollo 11, with Neil Armstrong among its crew, in preparation for the first moon landing, I just got into orbit around the moon.
Know where the article by John Felix was published (
Front page in lower left corner)
And other things that happened that day are more interesting and valuable to historians than articles that don\'t have other news backgrounds of the day.
We want to show all the glory of the archives as it should be consumed on the day of printing --
One question at a time.
Our goal is to create a smooth viewing experience rather than forcing users to download high-resolution images slowly.
How did we do it.
Our digital print file is large and contains several megabytes of high
Resolution page scan.
Storage requirements are considerable even for a problem.
The first issue of May 22, 1927 announced the groundbreaking cross between Charles Lindbergh
The Atlantic flight consists of 226 pages and requires nearly 200 MB of storage space.
When we build the time machine, we know that we can\'t expect users to go through morehundred-
In order to go through a problem, megabytes need to be downloaded.
We need a way to load the part of the problem that the user is looking.
We found the answer from a somewhat unexpected quarter, and now, when you load a 200 MB Lindberg question in your browser, the initial page load only needs to transfer a few megabytes.
We do this by using mapping software to display each issue.
Just like scanning the layout of the newspaper, the digital map is just a very large image.
The most commonly used technology for displaying digital maps (
Same technology we use on time machines)
Is the image tile.
Using Image tiling, large images are broken down into countless small square images, or \"tiling\" calculated at various zoom levels \".
The smart software then runs in the browser and only loads tiles corresponding to the image area the user wants to see.
Many open source software libraries have been created to make and display these blocks (
We make tiles and flyers using GDAL. js for display).
All we have to do is turn these libraries into newspapers for you.
To do this, we created a processing pipeline called The Time Machine Publisher.
This is how it works.
Three inputs are required for a given problem: High-
Scan the resolution of the page from the microfilm, the XML file of the article metadata, and the INI file that describes the geometric boundaries of each article on each page.
Pipes First stitch the page into a large virtual image.
The article on each page is so coordinate projection from Descartes (x, y)
Geographic coordinates (
Latitude and longitude)coordinates.
These projected coordinates are combined with the article metadata to form a large JavaScript object that describes the content of the complete problem.
Then, on several zoom levels, cut large virtual images into thousands of tiles with 256x256 pixels.
All this data is uploaded to the content distribution network (CDN).
Whenever a user requests a day\'s paper in a time machine, the client-
Side software downloads the JSON object that describes the content of the paper and only requests the tiles required to display the paper section that suits the user\'s viewport.
Additional data is loaded only when the user pan or scales.
Using this method, the time machine can provide customers with newspapers of any day quickly and efficiently.
We had a fascinating problem trying to expand the number of problems in the time machine.
Initially, time machine only contained articles published between 1851 and 1980.
The data after 1980 was excluded from an interesting historical quirk in our archives.
Starting from about 1981, The Times began to save the full digital text file that printed each article.
In order to expand the time machine to more than 1980 and include links to the full text, we need to know how the scan print archive and digital text archive align.
That\'s what we came up.
The first step is to run optical character recognition (OCR)
On scanning articles in print archive, transcribe text as clearly as possible.
The four-degree space we use-ocr for this.
This is a good example.
Eight texts: After doing this for every article of the day, we ended up getting a bucket of scanned printed articles marked with tesseract, as well as articles in a bucket of full-text archives
Then we have to figure out the article that matches between the two buckets, which is an interesting process.
Because the OCRed article rarely matches exactly the article corresponding to its full text, we cannot align the article by simply testing string equality.
We used fuzzy string matching.
Our approach applies a problem at a time and relies on a technology called \"shingling.
Using shingling, we convert the article text in both data sets to a list of tokens and then the list of tokens to n-
The sequence of markers is called the \"x80 x9c.
This sentence of Abraham Lincoln will explain: this is the full text of us.
We mark it by dividing it into a list of words separated by spaces.
The string \"secret\" is considered a token in full.
Now we convert the list of tokens to a list of tokens.
If we use the wooden tile size of 4, we end up with the following: 5 list of tokens. (
As you can see, the contents of the list overlap like the wooden tiles on the roof. )
When we generate a list of shingles for each article in the full-text Digital Archive, we get the following: this is a reasonable assumption, word sequences from an eight-article will overlap quite a bit with the word sequences from the same article in the full-text archive.
We would like a list of articles with each wooden tile so that we can narrow our selection.
By iterating through the list above, we can convert our data into the following hash table: now, we have mapped all the wooden tiles that appear in a given problem to the problem containing all the full text articles for each wooden tile, and we repeat the first part of the process with eight lines of text, get a list of wooden tiles for each article.
For example, eight a pieces are composed of wood tile 2 and wood tile 5.
We can generate an article candidate list using the table above, which may be a match to the article _.
By looking for shingle2 and shingle_5 in the table, we conclude that article _ 1, Article 2 and Article 5 are all potential matches of article _
This greatly reduces the problem space.
Now, it is not necessary to compare every eight articles in a question with every full-text article in a question, which may involve thousands of expensive comparisons of calculations, we just need to compare a short list.
This reduces the number of comparisons by several orders of magnitude.
To quantify the differences between octet data and full-text articles, we used the Python difflib Library.
It gives us a good, clean result: from this particular example, it is clear that eight articles _ a are likely to be the same article as full-text article _ 1.
Using this process, we can match about 80% of the articles.
The remaining 20% points are not clear enough on the score, which requires us to be smart.
In a perfect world, the relationship between our two buckets of articles is unique. to-
But in this world, it\'s actuallyto-many.
Some full-text articles are represented as multiple regions in the scanned archive, while some individual regions in the scanned archive correspond to multiple items in the full-text archive.
We reconcile this difference by dividing the data into paragraphs and performing processes similar to those described above at the paragraph level.
We take a nearperfect, many-to-
Many areas match the full-text archive, which is very searchable.
You can view it by browsing the entire time Archive on the time machine. nytimes. com.
Open is a blog written by the New York Times developer about code and development.
We cover everything from open source projects and APIs to technologies that support our newest products.
We want to share how we measure moderate performance changes when a page has many assets with variable speed and complexity that affect its performance.
We\'re open today. Purchasing store-
A magical middle section designed to simplify the acquisition, parsing, storage and retrieval of data in Android applications.
From our exploration work, we came up with two open source plugins to share with the community: Drone-
Google gke for container engines and drones-
Gae of Google\'s App Engine.
We are pleased to announce that we have started enabling HTTPS in The New York Times.
This is a work that helps protect the privacy of readers and ensure the authenticity of the content.
This month, we increased our support for 360.
Degree video in our core news products
Today, we have opened our 360 video framework for iOS and 360 presets.
Custom message
Chat Online
Chat Online
Chat Online inputting...
Sign in with: