As most of you will know the CEAC system was recently modified in a way that means I cannot extract the data. This is a disappointment to me personally – but also to the many people that gain understanding and insight from seeing this data. I have emailed the Department of State to request access to the data but I don’t hold out a lot of hope. They may even remove the CAPTCHA functionality themselves but if they don’t we are in the dark about how many visas have been issued and so on.
I have detailed below the description of a possible solution. It will take some community spirit and assistance both to get this developed and then on an ongoing basis. If no one does any work, we will be in the dark. I am happy to help as I have the central platform that everyone can use to collaborate. I have been studying the DV program since 2013 – and I consider the CEAC data to be a critical aspect of the understanding we have of this process.
So the question is – can we do this together?
I would like to hear from programmers by email with an understanding and perhaps simple proof of concept, once the solution is agreed. The solution is a starting point – I would be happy to get input from experienced engineers if anyone feels there is a better way to do this.
The recent addition of a CAPTCHA challenge on the CEAC site has curtailed the use of the screen scraping script that was previously in use. We are looking to re-instate the ability to extract that data by pooling community freetime as a resource.
Solution – high level.
The proposal is to build a webpage (form) hosted at BritSimonSays.com where interested community members could solve CAPTCHAs and thereby facilitate the extraction of CEAC data.
The member would open the webpage and enter their Member ID (any name they want to use) and an email address. That information should be saved by a cookie to allow this to persist in the form.
The member then clicks a button to solve a CAPTCHA. Scraper code would open a session with the CEAC site and retrieve the image of the CAPTCHA to present in our form. The member keys in the CAPTCHA and the scraper code would submit the captcha along with a Case Number. If the solve is correct the scraper code would then log the resulting case details to a database table.
The form provide feedback for the solving of the captcha and would refresh the form so the member could solve another captcha.
Captchas take around 10 seconds to solve (less as people get more expert). The scraper would retrieve and log the case in around 1 second or less.
So – one member donating around 10 minutes of time would solve around 50 to 100 cases. We would need a few hundred people around the world doing that on a regular basis to keep the extracts up to date. Their efforts could be tracked and frequent solvers would earn community appreciation, respect and admiration (!) – perhaps encouraged by publishing league tables of the most frequent extractors. I might be able to get some giveaways or at the very least – easy access to me for my case advice and assistance.
CEAC data would be being extracted and constantly refreshed by user input. The data itself could be extracted and made available by shared spreadsheet (as I have in the past) or perhaps constantly available (on demand).
What would be needed to make all this happen
Obviously this solution would rely on the community to come together and have many individuals donate some of their time to get us the data we are all interested in seeing. In this case, the old saying of “many hands make light work” is especially true.
The second aspect of what would be needed is some development help. My own IT skills are mainly in other types of enterprise systems, so although I know what sort of work needs to be done, I don’t have the technical skills to do this by myself. It is not that complicated – probably a few hours work for a good web developer. I already have a good understanding of what needs to be done, some code examples for how to implement the scraper. That would be based around the selenium driver with python or java.
The suggestion that this should be hosted on my site is because I already get sufficient traffic to enable this process. Via my site I have a significant readership which of course builds with each year due to the CEAC data itself but also the background information and assistance I provide. I currently pay for the hosting of my site myself, but I may decide to add some advertising to this page to fund the hosting costs associated with this increased functionality.
Technical detail and requirements (plus an indication of what code I have and where I would need development help)
The main piece of functionality is the page itself. I would imagine this would be an asp page or similar, capable of running the scraper code and interacting with a SQL database.
On my hosting website I have the ability to create a MySQL database and I have a database design (ERD below) that would support the solution.
I also have code that worked until recently (this is written in java), and have some experience of using selenium with python, although selenium works with java too. The reason to suggest using selenium is to be able to present the captcha image to be solved and at the same time passing the case number and retrieving the case details.
The form should allow multiple concurrent sessions using an algorithm to pick a case to extract once the user solves the captcha.
The form should be in some format that is deployable on my hosted website. I will maintain the complete source code for that page and all the functionality. In order to avoid having the Department of State move the goalposts again, it would be smart to not allow the scraper code to become publicly available.
A main table (CaseNumber) will store the extracted data. This will represent the latest status of each case. This is very similar to the format of the Excel spreadsheet I already produce. This is based on the data I know is available for each case, and will allow the presentation of the data to be very similar to the data people are used to seeing. There are five additional fields in that table:
- LastUpdatedBy – The member ID entered on the form for the extract/update that occurred last on that case.
- LastUpdatedTime – The time of the last update. As we move through the year (after the initial extract of ALL data), the system could select which case number to extract based on the oldest record. That way, we would be constantly directing the online effort to extracting the latest possible data.
- Current – is the case current or not according to the visa bulletin. This value (a Y/N boolean) will allow us to focus the extracts on current cases only.
- Extract – Should the record be extracted or not (a Y/N boolean). Again this allow us to turn on or off extracts for certain records. For example, once a case and all derivatives are marked as issued, there is no need to continually update that case. This, in conjunction with the current flag, and the last updated time will allow us to direct the online effort in the most efficient way.
- ExtractID – FK to the Extracts table.
Another table (Extracts) will store information about who extracted the data and take a snapshot of the update at that time (i.e. all the fields from CaseNumber table apart from the final 5 fields). By tracking the Member (and optionally getting an email address for the member) we can give some “credit” to the people doing the work. The table will also provide a history of status on each case over time. That history will allow us to better understand the time taken to resolve AP cases in particular, and see trends by consulate.
The Consulates table will allow us to provide a data mapping between the consulate ID provided in the CEAC file and the city/country assigned to that ID. We could therefore provide better searches of the consulates by country, or city. Furthermore I could extend this table in future to track certain data points, such as which consulates require an I-134, email or contact details for the consulates and so on.
It might be necessary to conceal the IP address in some way in order to make sure the activity is not seen as a robot. Because the volume will be based on human activity, this should not actually cause a problem for the DoS servers (and therefore should not illicit any action to foil our process.
I already have code snippets for selenium and working java code for the extract. I have the hosted website, I can create the database tables, and of course I have the global reach to the many hands that will make this possible!
Timeline for this proposal.
Well, that depends on the response and assistance I receive. I know this solution could be implemented. I am looking for a developer with appropriate skills to contact me (britsimon3 at gmail), so we could collaborate on this solution. If we work fast we can perhaps implement this in time to restore the access to CEAC data in a short timeframe for the remainder of DV2016. Since DV2016 is in progress, there are a great number of the 55000 cases that we can exclude from the extract (cases not current, cases already issued and so on). So – the extract could be fairly quick (I expect it could be done over a period of a few days).
Once DV2017 starts, CEAC data will not be available for a few months (CEAC is typically loaded around January 1 of each year). By that time, The DV2017 selectees will understand the value of the CEAC data and will hopefully take some responsibility to get the work done for the initial extract and incremental extracts thereafter.
However, I can’t do this all myself – does anyone want to help?