2017-01-29

pandas.qcutはcutする列の値に重複があると挙動がおかしくなることがある

現象は以下。

TitanicのtrainデータのAgeの欠損値を22.0で埋める
pd.qcut(train["Age"], 4, duplicates="drop").value_counts() を実行すると以下のようになり、要素数にかなり偏りができる

(0.419, 22.0]    408
(24.0, 35.0]     220
(35.0, 80.0]     217
(22.0, 24.0]      46
Name: Age, dtype: int64

「境界値をずらすともっと偏るのでしょうが無い」という状況なのかなと思いきや、そうでもなく、例えば以下の方が良い分割方法のように見える。

pd.qcut(train["Age"], [0, .2, .5, .75, 1.], duplicates="drop").value_counts()

(20.0, 24.0]     275
(24.0, 35.0]     220
(35.0, 80.0]     217
(0.419, 20.0]    179
Name: Age, dtype: int64

これはpandasの不具合なんだろうか。。値の重複があるときはqcutを使わないようにするか。。

2017-01-28

.pyファイルの絶対パスの取得方法

str(Path(__file__)) だと、JupyterNotebookで実行した時にエラーになるという問題がある。 os.path.abspath("__file__") であればJupyterNotebookでも実行可能。

参考：

qiita.com

2017-01-27

JupyterNotebookで親階層のモジュールをimportする方法

ディレクトリ階層が以下のようになっている場合。 grand_parent_dir > parent_dir > child_dir parent_dir/init.pyに書いている関数をimportして使いたいとする。 child_dirの下にJupyterNotebookがあるとする。その場合、JupyterNotebookで以下のように書いても読み込めない。

from ..parent_dir import func

Jupyterは親階層のファイルを読み込めないため。（多分セキュリティ的な問題で）でも以下のようにsys.pathにpathを追加すると読み込めるようになる。注意点としては、parent_dirまでのpathを追加してもだめで、grand_parent_dirまでのpathを追加しないといけない。でないとparent_dir/init.py内の関数をimportできなかった。

import sys, os
from pathlib import Path
sys.path.append(str(Path(os.path.dirname(os.path.abspath("__file__"))).parent.parent))

参考：

stackoverflow.com

2017-01-26

DataFrameの型指定をしてサイズを削減する手順を自動化

以前の記事でDataFrameの各列の型を指定してサイズを小さく出来ると書いたが、それを自動化する。

anton0825.hatenablog.com

以下を使ってcsvを読み込むとサイズを自動的に削減してくれる。一度全部読み込んでから型変換しているので、処理途中で使うメモリ量は増える。csvはこれを使ってDataFrameに変換した後、Feather形式で保存し以降はそれを使うようにするといい。

def read_csv(filename: str, verbose: bool = True) -> pd.DataFrame:
    # Seriesの型を最適なものにすることでメモリを削減する。
    # float16はFeather形式で保存出来ないため使用しない。
    df: pd.DataFrame = pd.read_csv(f"{INPUT_PATH}/{filename}")
    start_mem: float = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        _set_optimized_type(col, df)
    end_mem: float = df.memory_usage().sum() / 1024 ** 2
    _print_result(end_mem, start_mem, verbose)
    return df


def _print_result(end_mem: float, start_mem: float, verbose: bool):
    if verbose:
        reduce_percent: float = 100 * (start_mem - end_mem) / start_mem
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, reduce_percent))


def _set_optimized_type(col: str, df: pd.DataFrame):
    numerics: List[str] = ['int16', 'int32', 'int64', 'float32', 'float64']
    col_type: List[str] = df[col].dtypes
    if col_type in numerics:
        if str(col_type)[:3] == 'int':
            df[col] = df[col].astype(_to_optimized_int_type(df[col].min(), df[col].max()))
        else:
            df[col] = df[col].astype(_to_optimized_float_type(df[col].min(), df[col].max()))


def _to_optimized_int_type(c_min: int, c_max: int):
    if np.iinfo(np.int8).min < c_min and c_max < np.iinfo(np.int8).max:
        return np.int8
    elif np.iinfo(np.int16).min < c_min and c_max < np.iinfo(np.int16).max:
        return np.int16
    elif np.iinfo(np.int32).min < c_min and c_max < np.iinfo(np.int32).max:
        return np.int32
    elif np.iinfo(np.int64).min < c_min and c_max < np.iinfo(np.int64).max:
        return np.int64


def _to_optimized_float_type(c_min: float, c_max: float):
    if np.finfo(np.float32).min < c_min and c_max < np.finfo(np.float32).max:
        return np.float32
    else:
        return np.float64

参考：

qiita.com

2017-01-25

サーバでコマンドを実行すると「shell-init: error retrieving current directory: getcwd: cannot access parent directories」エラー

原因はカレントディレクトリが既に存在しなくなっていることだった。ターミナルで開いているディレクトリをWinSCPで削除し、その後コマンドを実行すると発生した。存在するディレクトリにcdしてからコマンドを実行すると直った。

参考：

arika.org

2017-01-24

sshでログインしようとすると「Connection to xxx closed.」「Exit status 254」エラー

CentOSの場合は以下の手順で直る。

/etc/ssh/sshd_config の UsePAM yes を UsePAM no に変更する

これでなぜ直るのかは不明。。 UsePAMはPluggable Authentication Moduleを使えるようにするかの設定で、 Pluggable Authentication Moduleは元々何も設定してないので使ってないと思っていたけど何か動いてたのかな。。

参考：

unix.stackexchange.com

open-groove.net

2017-01-23

pythonでファイルを読み込む場合のpathを書く場合の注意点

"test/data/img.jpg"のように指定すると、working directoryの位置に依存してしまう。 pythonファイルの実行方法は

Flaskサーバから実行する
開発環境でtestを実行する
CIサーバでtestを実行する

のようにいくつかあり、それぞれworking directoryが異なることが多いので以下のようにすべき。

Rootにしたいフォルダのinit.pyに以下を書く

from pathlib import Path

ROOT_PATH = str(Path(__file__).parent)

他のファイルはこのファイルをimportしてROOT_PATHからの相対パスの形でパスを書く

from foo_service import ROOT_PATH
os.environ["LOGGING_CONFIG"] = f"{ROOT_PATH}/test/logging.conf"

参考：

qiita.com

日々精進

新しく学んだことを書き留めていきます

pandas.qcutはcutする列の値に重複があると挙動がおかしくなることがある

.pyファイルの絶対パスの取得方法

JupyterNotebookで親階層のモジュールをimportする方法

DataFrameの型指定をしてサイズを削減する手順を自動化

サーバでコマンドを実行すると「shell-init: error retrieving current directory: getcwd: cannot access parent directories」エラー

sshでログインしようとすると「Connection to xxx closed.」「Exit status 254」エラー

pythonでファイルを読み込む場合のpathを書く場合の注意点