Create 2019 Indian Vidhan Sabha OCDIDs.#174
Conversation
|
@jamesturk @jpmckinney @jdmgoogle this is necessary for the imminent Vidhan Sabha elections. I am unable to assign reviewers so please assign yourself. |
|
Just want to bump this to make sure it's been seen. |
|
Format looks good to me – I haven't checked against source of truth. |
jdmgoogle
left a comment
There was a problem hiding this comment.
Thanks for putting this together. I just have some questions around the script, the naming structure, and the original source of truth.
scripts/create_ocd_ids.py
Outdated
| for parent in sorted(parent_set, key=lambda x: x.split(",")[-1]): | ||
| print(parent) | ||
|
|
||
| for state_abbr, state in contests.items(): |
There was a problem hiding this comment.
Please sort the output so it's easier to read.
There was a problem hiding this comment.
Do we wish to sort by state name, then district name or state name by constituency?
scripts/create_ocd_ids.py
Outdated
| # format hardcoded OCD ID | ||
| global new_file | ||
| ocd_id = "ocd-division/country:{}/state:{}/district:{}/cd:{}" | ||
| rest = "state {} district {} {} constituency {}" |
There was a problem hiding this comment.
The naming here is a bit awkward. Maybe
${constituency} constituency, ${district} district, ${state}
E.g.,
Khanapur constituency, Sangli district, Maharashtra
There was a problem hiding this comment.
Will do. I think I'll use semicolons in lieu of the commas here, unless we should add new column names for constituency, district, and state (can we do that? It could potentially serve a future purpose)
scripts/create_ocd_ids.py
Outdated
| "Kasba Peth": "Kasbapeth" | ||
| } | ||
|
|
||
| const_replacements = { |
There was a problem hiding this comment.
Why are these being replaced?
There was a problem hiding this comment.
With the electoral districts in India have changed frequently in the last decade, so there's a lot of erroneous information out there. Some of it has made its way to wikipedia, which is unfortunately the only place I'd found abbreviations. To reconcile the differences between the the ultimate source of truth (https://affidavit.eci.gov.in) and the spreadsheet of abbreviations to districts, I use this dictionary. (Actually, this particular set of replacements is going to be taken out of this PR as a more concrete source of constituencies has been found with all corresponding states and districts: https://electoralsearch.in
scripts/create_ocd_ids.py
Outdated
| contests = {"hr": "Haryana", "mh": "Maharashtra"} | ||
| columns = ["id", "name"] | ||
| country = "in" | ||
| election = "Vidhan Sabha" |
There was a problem hiding this comment.
OCD-IDs should be independent of any one election.
There was a problem hiding this comment.
Looking at the OCD-IDs for the Lok Sabha elections, it looks like the name of the election was included in the file containing the election (I was pattern matching). I'll take this out and generalize this script better.
scripts/create_ocd_ids.py
Outdated
|
|
||
| for c_row in consts: | ||
| # source of truth on district names: | ||
| # https://affidavit.eci.gov.in/ |
There was a problem hiding this comment.
What, exactly is being pulled from there? What's the input CSV that this script is munging?
There was a problem hiding this comment.
I will include another script I made that fetches data and creates constituency CSVs for each state from the new source. District and constituency information is pulled from that website. I'll detail the expected format of the district abbreviation in a comment, but that information must be retrieved from elsewhere; in this case, they were taken from wikipedia manually put into a spreadsheet without the use of any provided script.
There was a problem hiding this comment.
While that's useful to have, I'd prefer to split that out into a separate PR and have this one focus on only the OCD-IDs.
There was a problem hiding this comment.
Great. I split the PRs with this one containing the OCD IDs and another that adds the scripts that generate them.
jdmgoogle
left a comment
There was a problem hiding this comment.
Once the names of the constituencies are updated we should be good to go. Thanks.
Awesome, sounds good @jpmckinney . Let me know if there's any changes that need to be done on the additional OCD-IDs |
In this PR I've included a script that I created to generate OCD IDs specifically for the Indian Vidhan Sabha elections of Maharashtra and Haryana. The OCD IDs generated are for constituencies of these states, which include their districts.
There were some decisions made regarding districtnames due to discrepancies between the districts in wikipedia pages for [Maharashtra districts](https://en.wikipedia.org/wiki/List_of_districts_of_Maharashtra#Districts) and [constituencies](https://en.wikipedia.org/wiki/List_of_constituencies_of_the_Maharashtra_Legislative_Assembly)and Haryana districts and[constituencies](https://en.wikipedia.org/wiki/List_of_constituencies_of_the_Haryana_Legislative_Assembly). Thedistrict pages were deferred to over the analogous columns in the constituency pages after someresearch. They are as follows:-Yamunanagaris used overYamuna Nagar-Gondiais used overGondiya-Gurugramis used overGurgaon-Nuhis used over ``Mewatwhere applicable.UPDATE: For source of truth, it was determined that
https://affidavit.eci.gov.inis the source of truth regarding Consituency and District names.Additionally, no changes were made to the aliases file located in
identifiers/countries-inas it's unclear if that was necessary.