Downloading data from SIM

In this notebook we will use PySUS to download and treat mortality data from SIM.

[1]:
from pysus.online_data import SIM, parquets_to_dataframe
from pysus.preprocessing.decoders import decodifica_idade_SIM, translate_variables_SIM
[2]:
df = parquets_to_dataframe(SIM.download('AL',2019))
df
[2]:
ORIGEM TIPOBITO DTOBITO HORAOBITO NATURAL CODMUNNATU DTNASC IDADE SEXO RACACOR ... FONTES TPRESGINFO TPNIVELINV NUDIASINF DTCADINF MORTEPARTO DTCONCASO FONTESINF ALTCAUSA CONTADOR
0 1 2 07022019 2145 827 270910 23041944 474 1 1 ... 444
1 1 2 08022019 0600 827 270020 10011933 486 1 1 ... 445
2 1 2 27012019 1630 823 230730 10051930 488 2 1 ... 768
3 1 2 14012019 2030 827 270400 03111929 489 2 4 ... 769
4 1 2 17022019 0927 827 270000 10091935 483 1 4 ... 803
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20282 1 2 25102019 0920 08111942 476 2 4 ... 1377883
20283 1 2 26102019 1810 827 270000 07071968 451 2 4 ... 1377884
20284 1 2 26102019 1650 827 270000 15121997 421 1 4 ... 1377885
20285 1 2 25102019 1450 02041940 479 1 4 ... 1377886
20286 1 2 24102019 0300 20102019 204 2 ... SXXSXX 24122019 3 18122019 2 1377887

20287 rows × 87 columns

[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20287 entries, 0 to 20286
Data columns (total 87 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   ORIGEM      20287 non-null  string
 1   TIPOBITO    20287 non-null  string
 2   DTOBITO     20287 non-null  string
 3   HORAOBITO   20287 non-null  string
 4   NATURAL     20287 non-null  string
 5   CODMUNNATU  20287 non-null  string
 6   DTNASC      20287 non-null  string
 7   IDADE       20287 non-null  string
 8   SEXO        20287 non-null  Int64
 9   RACACOR     20287 non-null  string
 10  ESTCIV      20287 non-null  string
 11  ESC         20287 non-null  string
 12  ESC2010     20287 non-null  string
 13  SERIESCFAL  20287 non-null  string
 14  OCUP        20287 non-null  string
 15  CODMUNRES   20287 non-null  Int64
 16  LOCOCOR     20287 non-null  string
 17  CODESTAB    20287 non-null  string
 18  ESTABDESCR  20287 non-null  string
 19  CODMUNOCOR  20287 non-null  string
 20  IDADEMAE    20287 non-null  string
 21  ESCMAE      20287 non-null  string
 22  ESCMAE2010  20287 non-null  string
 23  SERIESCMAE  20287 non-null  string
 24  OCUPMAE     20287 non-null  string
 25  QTDFILVIVO  20287 non-null  string
 26  QTDFILMORT  20287 non-null  string
 27  GRAVIDEZ    20287 non-null  string
 28  SEMAGESTAC  20287 non-null  string
 29  GESTACAO    20287 non-null  string
 30  PARTO       20287 non-null  string
 31  OBITOPARTO  20287 non-null  string
 32  PESO        20287 non-null  string
 33  TPMORTEOCO  20287 non-null  string
 34  OBITOGRAV   20287 non-null  string
 35  OBITOPUERP  20287 non-null  string
 36  ASSISTMED   20287 non-null  string
 37  EXAME       20287 non-null  string
 38  CIRURGIA    20287 non-null  string
 39  NECROPSIA   20287 non-null  string
 40  LINHAA      20287 non-null  string
 41  LINHAB      20287 non-null  string
 42  LINHAC      20287 non-null  string
 43  LINHAD      20287 non-null  string
 44  LINHAII     20287 non-null  string
 45  CAUSABAS    20287 non-null  string
 46  CB_PRE      20287 non-null  string
 47  COMUNSVOIM  20287 non-null  string
 48  DTATESTADO  20287 non-null  string
 49  CIRCOBITO   20287 non-null  string
 50  ACIDTRAB    20287 non-null  string
 51  FONTE       20287 non-null  string
 52  NUMEROLOTE  20287 non-null  string
 53  TPPOS       20287 non-null  string
 54  DTINVESTIG  20287 non-null  string
 55  CAUSABAS_O  20287 non-null  string
 56  DTCADASTRO  20287 non-null  string
 57  ATESTANTE   20287 non-null  string
 58  STCODIFICA  20287 non-null  string
 59  CODIFICADO  20287 non-null  string
 60  VERSAOSIST  20287 non-null  string
 61  VERSAOSCB   20287 non-null  string
 62  FONTEINV    20287 non-null  string
 63  DTRECEBIM   20287 non-null  string
 64  ATESTADO    20287 non-null  string
 65  DTRECORIGA  20287 non-null  string
 66  CAUSAMAT    20287 non-null  string
 67  ESCMAEAGR1  20287 non-null  string
 68  ESCFALAGR1  20287 non-null  string
 69  STDOEPIDEM  20287 non-null  string
 70  STDONOVA    20287 non-null  string
 71  DIFDATA     20287 non-null  string
 72  NUDIASOBCO  20287 non-null  string
 73  NUDIASOBIN  20287 non-null  string
 74  DTCADINV    20287 non-null  string
 75  TPOBITOCOR  20287 non-null  string
 76  DTCONINV    20287 non-null  string
 77  FONTES      20287 non-null  string
 78  TPRESGINFO  20287 non-null  string
 79  TPNIVELINV  20287 non-null  string
 80  NUDIASINF   20287 non-null  string
 81  DTCADINF    20287 non-null  string
 82  MORTEPARTO  20287 non-null  string
 83  DTCONCASO   20287 non-null  string
 84  FONTESINF   20287 non-null  string
 85  ALTCAUSA    20287 non-null  string
 86  CONTADOR    20287 non-null  string
dtypes: Int64(2), string(85)
memory usage: 13.5 MB

Humanizing some of the encoded variables.

[4]:
df2  = translate_variables_SIM(df)
df2
2023-04-10 08:33:13.185 | DEBUG    | pysus.online_data.SIM:get_municipios:180 - Stablishing connection with ftp.datasus.gov.br.
220 Microsoft FTP Service
2023-04-10 08:33:13.209 | DEBUG    | pysus.online_data.SIM:get_municipios:184 - Changing FTP work dir to: /dissemin/publicos/SIM/CID10/TABELAS
2023-04-10 08:33:13.210 | INFO     | pysus.online_data.SIM:get_municipios:194 - Local parquet file found at /home/luabida/pysus/SIM_CADMUN_.parquet
/home/luabida/Projetos/InfoDengue/PySUS/pysus/preprocessing/decoders.py:122: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  return df["MUNCODDV"].append(df["MUNCOD"]).astype("int64").values
[4]:
ORIGEM TIPOBITO DTOBITO HORAOBITO NATURAL CODMUNNATU DTNASC IDADE SEXO RACACOR ... TPRESGINFO TPNIVELINV NUDIASINF DTCADINF MORTEPARTO DTCONCASO FONTESINF ALTCAUSA CONTADOR IDADE_ANOS
0 1 2 07022019 2145 827 270910 23041944 474 Masculino Branca ... 444 74.000000
1 1 2 08022019 0600 827 270020 10011933 486 Masculino Branca ... 445 86.000000
2 1 2 27012019 1630 823 230730 10051930 488 Feminino Branca ... 768 88.000000
3 1 2 14012019 2030 827 270400 03111929 489 Feminino Parda ... 769 89.000000
4 1 2 17022019 0927 827 270000 10091935 483 Masculino Parda ... 803 83.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20282 1 2 25102019 0920 08111942 476 Feminino Parda ... 1377883 76.000000
20283 1 2 26102019 1810 827 270000 07071968 451 Feminino Parda ... 1377884 51.000000
20284 1 2 26102019 1650 827 270000 15121997 421 Masculino Parda ... 1377885 21.000000
20285 1 2 25102019 1450 02041940 479 Masculino Parda ... 1377886 79.000000
20286 1 2 24102019 0300 20102019 204 Feminino NA ... 24122019 3 18122019 2 1377887 0.010959

20287 rows × 88 columns

[5]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20287 entries, 0 to 20286
Data columns (total 88 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   ORIGEM      20287 non-null  string
 1   TIPOBITO    20287 non-null  string
 2   DTOBITO     20287 non-null  string
 3   HORAOBITO   20287 non-null  string
 4   NATURAL     20287 non-null  string
 5   CODMUNNATU  20287 non-null  string
 6   DTNASC      20287 non-null  string
 7   IDADE       20287 non-null  string
 8   SEXO        20287 non-null  category
 9   RACACOR     20287 non-null  category
 10  ESTCIV      20287 non-null  string
 11  ESC         20287 non-null  string
 12  ESC2010     20287 non-null  string
 13  SERIESCFAL  20287 non-null  string
 14  OCUP        20287 non-null  string
 15  CODMUNRES   20287 non-null  category
 16  LOCOCOR     20287 non-null  string
 17  CODESTAB    20287 non-null  string
 18  ESTABDESCR  20287 non-null  string
 19  CODMUNOCOR  20287 non-null  string
 20  IDADEMAE    20287 non-null  string
 21  ESCMAE      20287 non-null  string
 22  ESCMAE2010  20287 non-null  string
 23  SERIESCMAE  20287 non-null  string
 24  OCUPMAE     20287 non-null  string
 25  QTDFILVIVO  20287 non-null  string
 26  QTDFILMORT  20287 non-null  string
 27  GRAVIDEZ    20287 non-null  string
 28  SEMAGESTAC  20287 non-null  string
 29  GESTACAO    20287 non-null  string
 30  PARTO       20287 non-null  string
 31  OBITOPARTO  20287 non-null  string
 32  PESO        20287 non-null  string
 33  TPMORTEOCO  20287 non-null  string
 34  OBITOGRAV   20287 non-null  string
 35  OBITOPUERP  20287 non-null  string
 36  ASSISTMED   20287 non-null  string
 37  EXAME       20287 non-null  string
 38  CIRURGIA    20287 non-null  string
 39  NECROPSIA   20287 non-null  string
 40  LINHAA      20287 non-null  string
 41  LINHAB      20287 non-null  string
 42  LINHAC      20287 non-null  string
 43  LINHAD      20287 non-null  string
 44  LINHAII     20287 non-null  string
 45  CAUSABAS    20287 non-null  string
 46  CB_PRE      20287 non-null  string
 47  COMUNSVOIM  20287 non-null  string
 48  DTATESTADO  20287 non-null  string
 49  CIRCOBITO   20287 non-null  string
 50  ACIDTRAB    20287 non-null  string
 51  FONTE       20287 non-null  string
 52  NUMEROLOTE  20287 non-null  string
 53  TPPOS       20287 non-null  string
 54  DTINVESTIG  20287 non-null  string
 55  CAUSABAS_O  20287 non-null  string
 56  DTCADASTRO  20287 non-null  string
 57  ATESTANTE   20287 non-null  string
 58  STCODIFICA  20287 non-null  string
 59  CODIFICADO  20287 non-null  string
 60  VERSAOSIST  20287 non-null  string
 61  VERSAOSCB   20287 non-null  string
 62  FONTEINV    20287 non-null  string
 63  DTRECEBIM   20287 non-null  string
 64  ATESTADO    20287 non-null  string
 65  DTRECORIGA  20287 non-null  string
 66  CAUSAMAT    20287 non-null  string
 67  ESCMAEAGR1  20287 non-null  string
 68  ESCFALAGR1  20287 non-null  string
 69  STDOEPIDEM  20287 non-null  string
 70  STDONOVA    20287 non-null  string
 71  DIFDATA     20287 non-null  string
 72  NUDIASOBCO  20287 non-null  string
 73  NUDIASOBIN  20287 non-null  string
 74  DTCADINV    20287 non-null  string
 75  TPOBITOCOR  20287 non-null  string
 76  DTCONINV    20287 non-null  string
 77  FONTES      20287 non-null  string
 78  TPRESGINFO  20287 non-null  string
 79  TPNIVELINV  20287 non-null  string
 80  NUDIASINF   20287 non-null  string
 81  DTCADINF    20287 non-null  string
 82  MORTEPARTO  20287 non-null  string
 83  DTCONCASO   20287 non-null  string
 84  FONTESINF   20287 non-null  string
 85  ALTCAUSA    20287 non-null  string
 86  CONTADOR    20287 non-null  string
 87  IDADE_ANOS  20286 non-null  float64
dtypes: category(3), float64(1), string(84)
memory usage: 13.2 MB